

(19) RÉPUBLIQUE FRANÇAISE  
INSTITUT NATIONAL  
DE LA PROPRIÉTÉ INDUSTRIELLE  
PARIS

(11) N° de publication :  
(à n'utiliser que pour les  
commandes de reproduction)

2 785 695

(21) N° d'enregistrement national : 98 14012

(51) Int Cl<sup>7</sup> : G 06 F 17/00

(12)

## DEMANDE DE BREVET D'INVENTION

A1

(22) Date de dépôt : 06.11.98.

(30) Priorité :

(43) Date de mise à la disposition du public de la demande : 12.05.00 Bulletin 00/19.

(56) Liste des documents cités dans le rapport de recherche préliminaire : Se reporter à la fin du présent fascicule

(60) Références à d'autres documents nationaux apparentés :

(71) Demandeur(s) : BULL CP8 Société anonyme — FR et INRIA INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE — FR.

(72) Inventeur(s) : SCHULTZ ULRIK PAGH, CONSEL CHARLES, MULLER GILLES, CLAUSEN LARS et GOIRE CHRISTIAN.

(73) Titulaire(s) :

(74) Mandataire(s) : CABINET PLASSERAUD.

(54) PROCEDE DE COMPACTAGE D'UN PROGRAMME DE TYPE CODE OBJET INTERMEDIAIRE EXECUTABLE DANS UN SYSTEME EMBARQUE MUNI DE RESSOURCES DE TRAITEMENT DE DONNEES, SYSTEME COMPACTEUR ET SYSTEME EMBARQUE MULTI-APPLICATIONS CORRESPONDANTS.

(57) L'invention concerne un procédé et un système de compactage d'un programme intermédiaire.

Le programme est soumis à une recherche (1000) de séquences identiques ( $S_i$ ) et à un comptage du nombre  $N_i$  d'occurrences de chaque séquence ( $S_i$ ). Un test (1001) de comparaison de supériorité d'une fonction  $f(N_i)$  à une valeur de référence permet de créer (1003) une instruction spécifique de code spécifique ( $C_i$ ) auquel est associée la séquence ( $S_i$ ). Un remplacement de chaque occurrence de la séquence ( $S_i$ ) par le code spécifique ( $C_i$ ) est effectué (1004) dans le programme intermédiaire pour engendrer un programme intermédiaire compacté (FEX) auquel est associé un fichier d'exécution (FEX).

Application à des objets portatifs multi-applications tels que carte à microprocesseur, systèmes embarqués ou analogues.



PROCEDE DE COMPACTAGE D'UN PROGRAMME DE TYPE CODE OBJET  
INTERMEDIAIRE EXECUTABLE DANS UN SYSTEME EMBARQUE MUNI DE  
RESSOURCES DE TRAITEMENT DE DONNEES, SYSTEME COMPACTEUR  
ET SYSTEME EMBARQUE MULTI-APPLICATIONS CORRESPONDANTS

5

La présente invention est relative à un procédé de compactage d'un programme de type code objet intermédiaire, exécutable dans un système embarqué muni de ressources de traitement de données et au système compacteur correspondant.

Les systèmes embarqués munis de ressources de traitement de données actuels permettent de remplir des fonctions de plus en plus complexes et de plus en plus nombreuses, en raison de l'optimisation croissante de l'adéquation entre le matériel, constitutif de ces objets portatifs, et des logiciels, ou plus particulièrement des programmes ou applications implantés dans ces derniers, afin de leur conférer une ou plusieurs fonctionnalités spécifiques. La notion de système embarqué recouvre tout système informatique portable, tel qu'objet portatif, carte à microprocesseur ou analogue, distinct d'un micro-ordinateur classique.

C'est en particulier le cas des cartes à microprocesseur, encore appelées cartes à puce, telles que représentées en figure 1a, pour lesquelles on utilise un compilateur pour engendrer des instructions et un interpréteur permettant d'assurer l'exécution de ces instructions par le microprocesseur, ainsi que représenté en figure 1b. De manière classique, ainsi que représenté sur la figure 1a, une carte à microprocesseur 10 comprend un système d'entrée/sortie 12, relié au microprocesseur 14, une mémoire RAM 16, une mémoire non volatile 18, constituée par

une mémoire morte ROM 18b et une mémoire programmable 18a. L'ensemble de ces éléments est relié au microprocesseur 14 par une liaison par BUS. Un module 20 de chiffrement/déchiffrement de données peut, le cas échéant, 5 être prévu.

L'implantation de l'ensemble des éléments logiciels d'applications, tels que porte-monnaie électronique, commerce électronique ou santé, dans la mémoire programmable non volatile, de l'interpréteur en mémoire programmable non 10 volatile ou en mémoire morte et du système d'exploitation, en mémoire morte ROM, est représentée en figure 1c.

Le code objet intermédiaire est engendré par le compilateur à partir d'un programme source, le plus souvent écrit en langage de haut niveau, à partir des caractères 15 ASCII. Le programme source et le code objet intermédiaire correspondant peuvent être exécutés par tous les microprocesseurs usuels, car l'interpréteur assure l'adaptation logicielle des instructions standard du code objet intermédiaire en instructions directement exécutables 20 par le microprocesseur.

A titre d'exemple non limitatif, les fabricants de cartes à microprocesseur ont récemment développé des interpréteurs implantés dans la mémoire morte ROM. Ce type d'interpréteur lit de façon séquentielle un programme ou 25 code objet intermédiaire, support d'une application par exemple, chargé par exemple dans la mémoire programmable de la carte à microprocesseur. Chaque instruction standard de ce code objet intermédiaire est interprétée par l'interpréteur, puis exécutée par le microprocesseur. En 30 règle générale, les instructions standard du code objet intermédiaire permettent de traiter des fonctions évoluées telles que le traitement arithmétique et la manipulation

d'objets. La notion d'objet concerne les objets informatiques tels que des listes, tableaux de données ou analogues.

Toutefois, en raison notamment du caractère portatif de ces cartes à microprocesseur, l'encombrement et la taille de ces dernières sont limités. Il en est de même de la taille de la mémoire programmable de ces derniers, laquelle est, par construction, limitée à quelques kilo-octets. Une telle limitation structurelle ne permet pas la mise en œuvre de gros programmes d'application.

En outre, la tendance actuelle de la mise en œuvre de systèmes embarqués multi-applications trouve une limitation rédhibitoire à la multiplication du nombre d'applications installées sur un même système embarqué ou carte à microprocesseur, à un nombre excédant rarement trois applications.

La présente invention a pour objet de remédier à l'inconvénient précité par la mise en œuvre d'un procédé de compactage d'un programme de type code objet intermédiaire, utilisable dans un système embarqué de type carte à microprocesseur, afin de libérer de l'espace mémoire dans la mémoire programmable de ce système embarqué et permettre ainsi l'implantation d'au moins une application supplémentaire, après compactage de cette dernière.

Un autre objet de la présente invention est en outre la mise en œuvre d'un système de compactage de programmes de type code objet intermédiaire permettant l'implantation d'un programme de type code objet intermédiaire compacté dans un système embarqué multi-applications muni de ressources de traitement de données permettant l'exécution de programmes de type code objet intermédiaires compactés en l'absence de modification notable de la durée d'exécution, et en

transparence totale vis-à-vis du processus inhérent à chaque application non compactée.

Le procédé de compactage, objet de l'invention, d'un programme de type code objet intermédiaire consistant en une suite d'instructions standard, ce système embarqué étant doté d'une mémoire et d'un interpréteur de langage du programme de type code objet intermédiaire en instructions d'un code objet directement exécutables par un microprocesseur et ce programme étant normalement mémorisé dans la mémoire de ce système embarqué, est remarquable en ce que l'on recherche dans le programme de type code objet intermédiaire des séquences identiques d'instructions standard successives et l'on soumet les séquences identiques d'instructions standard successives à un test de comparaison de supériorité d'une fonction d'au moins le nombre d'occurrences de ces séquences dans le programme de type code objet intermédiaire à une valeur de référence. Sur réponse positive au test précédent, pour chaque séquence identique d'instructions standard successives satisfaisant à l'étape de test, on engendre une instruction spécifique par définition d'un code opératoire spécifique et association à ce code opératoire spécifique de la séquence d'instructions standard successives ayant satisfait à ce test. On remplace en outre, dans le programme de type code objet intermédiaire mémorisé, chaque occurrence de chaque séquence d'instructions standard successives par le code opératoire spécifique qui lui est associé, pour obtenir un programme de type code objet intermédiaire compacté, succession d'instructions standard et de codes opératoires spécifiques. On mémorise dans la mémoire une table de décompactage permettant la mise en correspondance biunivoque entre chaque code opératoire spécifique introduit et la séquence

d'instructions standard successives associée à ce dernier. Ce processus permet d'optimiser l'espace mémoire occupé par le programme de type code objet intermédiaire compacté par mémorisation dans la mémoire programmable d'une seule 5 occurrence des séquences identiques d'instructions standard successives.

Le procédé, le système de compactage d'un programme de type code objet intermédiaire et le système embarqué multi-applications correspondant, objets de la présente 10 invention, trouvent application dans le domaine technique des systèmes embarqués, plus particulièrement dans la mise en œuvre et la gestion de cartes à microprocesseur.

Ils seront mieux compris à la lecture de la description et à l'observation des dessins ci-après dans 15 lesquels, outre les figures 1a à 1c relatives à l'art antérieur,

- la figure 2a représente un organigramme général illustratif d'un procédé de compactage d'un programme de type code objet intermédiaire, selon la présente invention ;
- 20 - la figure 2b représente un schéma synoptique illustratif de la mise en œuvre des différents opérateurs nécessaires à l'obtention d'un programme de type code objet intermédiaire compacté et de paramètres permettant le décompactage ou l'exécution de ce programme ;
- 25 - la figure 2c représente, à titre purement illustratif, l'implantation en mémoire programmable, non volatile, d'une carte à microprocesseur de ce programme de type code objet intermédiaire compacté et des paramètres d'exécution ou décompactage de ce dernier ;
- 30 - la figure 3a représente, dans un mode de réalisation particulier non limitatif, un schéma illustratif de la structure d'un premier fichier constitutif de ces

paramètres d'exécution ou décompactage de ce programme de type code objet intermédiaire compacté ;

5 - la figure 3b représente, dans un mode de réalisation particulier non limitatif, un schéma illustratif de la structure d'un deuxième fichier constitutif de ces paramètres d'exécution ou décompactage de ce programme de type code objet intermédiaire compacté ;

10 - la figure 4 représente, à titre illustratif, l'implantation en mémoire programmable non volatile d'un programme de type code objet intermédiaire compacté, selon la présente invention, dans une carte à microprocesseur ou système embarqué multi-applications ;

15 - la figure 5 représente, à titre illustratif, un processus de mise en œuvre spécifique du procédé de compactage d'un programme de type code objet intermédiaire dans lequel une actualisation des codes spécifiques relatifs à des applications ou programmes de type code objet intermédiaires distincts est réalisé ;

20 - les figures 6a et 6b représentent, sous forme d'éléments fonctionnels, un système de compactage d'un programme de type code objet intermédiaire conforme à l'objet de la présente invention.

25 Le procédé de compactage d'un programme de type code objet intermédiaire, conforme à l'objet de la présente invention, sera maintenant décrit en liaison avec la figure 2a. La désignation programme type code objet intermédiaire recouvre tout programme intermédiaire dans la présente demande de brevet.

30 Ce procédé sera décrit de manière non limitative dans le cas de sa mise en œuvre dans un système embarqué constitué par exemple par une carte à microprocesseur telle que représentée en figure 1a, ce programme de type code

objet intermédiaire étant obtenu de manière classique, ainsi que représenté en figure 1b, et l'implantation en mémoire programmable d'une pluralité d'applications de l'interpréteur et du système d'exploitation OS en mémoire ROM étant représentée en figure 1c, de manière non limitative.

Le programme de type code objet intermédiaire consiste en une suite d'instructions standard exécutables par le microprocesseur par l'intermédiaire de l'interpréteur.

Le procédé de compactage d'un tel programme consiste, préalablement à l'implantation de ce dernier en mémoire programmable 18a, à effectuer, ainsi que représenté en figure 2a, en une étape 1000, une recherche dans le programme de type code objet intermédiaire des séquences identiques d'instructions standard successives, ces séquences identiques étant notées  $S_i$ . Par séquences identiques, on indique une suite d'un nombre  $n$  d'octets déterminé susceptible d'apparaître de manière répétitive dans le programme de type code objet intermédiaire précité. Ainsi, le rang  $i$  des séquences identiques indique, pour des valeurs de  $i$  différentes, des séquences distinctes. En outre, l'étape 1000 de recherche précitée consiste à déterminer le nombre d'occurrences  $N_i$  de chaque séquence identique  $S_i$  précitée. A l'issue de l'étape 1000 de recherche, on dispose d'une pluralité de séquences identiques  $S_i$ , chaque séquence  $S_i$  étant distincte, et d'un nombre  $N_i$  représentant le nombre d'occurrences dans le programme de type code objet intermédiaire de chacune des séquences  $S_i$ .

Suite à l'étape 1000 précitée, le procédé de compactage, objet de la présente invention, consiste à

soumettre, en une étape 1001, les séquences identiques d'instructions standard successives  $S_i$  à un test de comparaison d'une fonction  $f(N_i)$  d'au moins le nombre d'occurrences  $N_i$  associé à une séquence identique  $S_i$ . Sur la 5 figure 2a, le test de comparaison est noté :

$$f(N_i) > \text{Ref.}$$

Lorsque la réponse au test 1001 est négative, la fonction d'au moins le nombre d'occurrences  $N_i$  n'étant pas supérieure à la valeur de référence, le test 1001 est 10 appliqué à la séquence identique suivante, de rang  $i+1$ , par incrémentation de l'indice  $i$  à l'étape 1002.

Les étapes 1000, 1001 et 1002 représentées en figure 2a permettent ainsi de rechercher dans le programme de type code objet intermédiaire l'ensemble des séquences ou séries 15 d'octets identiques ou, à tout le moins, un nombre significatif donné de ces séquences identiques, ainsi qu'il sera décrit ultérieurement dans la description.

Sur réponse positive au test 1001 précité, le procédé de compactage, objet de la présente invention, 20 consiste ensuite à engendrer une instruction spécifique, notée  $IS_i$ , par définition d'un code opératoire spécifique, noté  $C_i$ , et association à ce code opératoire spécifique de la séquence d'instructions standard successives ayant satisfait au test, la séquence d'instructions standard 25 successives  $S_i$ . Sur la figure 2a, l'étape de création d'instructions spécifiques est notée :

$$IS_i = C_i:S_i.$$

On indique que l'étape de définition d'un code opératoire spécifique et d'association à ce dernier de la séquence 30 d'instructions standard successives  $S_i$  peut consister en l'attribution d'une valeur de code et l'association sous

forme d'une liste par exemple de cette valeur de code et de la séquence d'instructions  $S_i$  précitée.

Suite à l'étape 1003, le procédé de compactage consiste ensuite, à l'étape 1004, à remplacer dans le programme de type code objet intermédiaire mémorisé chaque occurrence de la séquence d'instructions successives standard  $S_i$  par le code opératoire spécifique  $C_i$  qui lui est associé pour obtenir un programme de type code objet compacté, noté FCC, succession d'instructions standard et de codes opératoires spécifiques  $C_i$ .

Le processus de remplacement peut alors être réitéré pour chaque séquence ou série d'instructions standard identiques  $S_i$  tant que l'indice  $i$  est inférieur à un nombre  $P$  de séquences identiques, un test 1005 de comparaison de l'indice  $i$  à la valeur  $P$  permettant, sur réponse positive à ce test, le retour à l'étape d'incrémentation 1002 de l'indice  $i$  précédemment décrit.

On comprend en particulier que, suite à l'itération du processus de remplacement ainsi formé, on obtient un programme de type code objet compacté, noté FCC, auquel est associé un fichier d'exécution de ce dernier, fichier noté FEX, ce fichier d'exécution consistant au moins en une mise en correspondance biunivoque entre chaque code spécifique  $C_i$  et la séquence d'instructions standard successives  $S_i$  précitée.

Suite à l'obtention des deux fichiers précités, programme de type code objet intermédiaire compacté et fichier d'exécution, sur réponse négative au test 1005 par exemple, il est possible de procéder à une mémorisation, dans la mémoire programmable 18a par exemple, du programme de type code objet intermédiaire compacté obtenu FCC précité, et bien entendu du fichier d'exécution FEX.

précédemment mentionné. La mémorisation précitée peut de manière non limitative être effectuée dans la mémoire non volatile 18, mémoire programmable 18a ou même mémoire morte 18b.

5 En ce qui concerne le test de comparaison 1001 précité, on indique bien entendu que la fonction d'au moins le nombre d'occurrences de chaque séquence identique Si peut être définie de façon à obtenir une optimisation du gain de compactage ainsi réalisé. Dans un mode de réalisation non 10 limitatif, on indique que cette fonction peut être établie de façon à réaliser une comparaison de la taille de chaque séquence identique d'instructions standard successives en nombre d'octets à une valeur de seuil, exprimée par exemple en nombre d'instructions standard.

15 La figure 2b décrit, à titre d'exemple illustratif, un mode opératoire permettant d'engendrer un programme de type code objet intermédiaire compacté conformément à la mise en œuvre du procédé, objet de la présente invention.

Dans un premier temps, le créateur du programme de 20 type code objet intermédiaire réalise un fichier de type texte contenant le programme source. Ce programme établi par ce dernier à partir d'un langage évolué est, de manière générale, écrit en code ASCII de manière à être lu facilement et à pouvoir contenir des commentaires qui 25 facilitent, d'une part, la compréhension, et d'autre part, la mise au point de ce dernier. Le programme source ainsi obtenu est introduit dans un compilateur de type classique, dit compilateur standard, dont le rôle consiste à transformer chaque ligne de programme en instructions 30 exécutables ou, à tout le moins, en instructions interprétables pour obtenir un programme de type code objet

intermédiaire consistant en une suite d'instructions standard interprétables par l'interpréteur.

Le fichier de type code objet intermédiaire ainsi obtenu suite au processus de compilation est introduit dans 5 un système compacteur permettant la mise en œuvre du procédé de compactage précédemment décrit en liaison avec la figure 1. Ce système compacteur sera décrit ultérieurement dans la description.

Le processus de compactage mis en œuvre, ainsi que 10 décrit précédemment, permet alors l'obtention d'un fichier d'instructions interprétables FCC, c'est-à-dire du fichier constitutif du programme de type code objet intermédiaire compacté, et du fichier d'exécution FEX précédemment mentionné dans la description.

15 Le mode opératoire du système de compactage sera décrit ci-après dans un exemple spécifique de mise en œuvre.

En premier lieu, le système compacteur analyse toutes les instructions standard  $I_s$  et dresse une liste de toutes les séries d'instructions standard existant dans le 20 fichier constitutif de ce dernier.

Si le fichier précité contient 1000 octets par exemple, le système compacteur lance une procédure de recherche de toutes les séries d'au moins deux octets jusqu'à un nombre  $Q$  par exemple. La recherche précitée peut 25 être effectuée pour des séries de deux octets, puis de trois octets, et ainsi de suite jusqu'à  $Q$  octets. Dans un mode de réalisation préférentiel, le nombre  $Q$  avait la valeur 500.

Ainsi, pour chaque séquence d'instructions  $S_i$ , formée par une série d'instructions standard  $I_s$ , le système 30 compacteur détermine si cette séquence  $S_i$  est déjà dans la liste. Dans un tel cas, le système compacteur rajoute une unité au nombre d'occurrences  $N_i$  de la séquence  $S_i$  précitée.

A la fin du processus de recherche précité, le système compacteur a ainsi engendré une liste complexe contenant l'ensemble des séquences d'instructions  $S_i$  examinées, à chaque séquence étant associé un nombre 5 d'occurrences  $N_i$  dans le programme de type code objet intermédiaire considéré.

Un tableau illustratif est donné ci-après pour un programme de type code objet intermédiaire constitué par la série d'instructions ci-après :

10 1-7-3-5-7-3-7-3-5-7.

Alors que pour l'exemple illustratif donné dans le tableau, TABLEAU 1 ci-après, la série d'instructions précitée comporte dix instructions, chaque instruction étant représentée par un octet et illustrée par un chiffre de 1 à 15 7, les séquences d'instructions successives examinées comprennent 2, 3, 4 puis 5 octets.

Les séquences d'instructions successives  $S_i$ , dont le nombre d'occurrences dans le programme de type code objet intermédiaire précité est supérieur ou égal à deux, sont 20 données dans le tableau ci-après.

TABLEAU 1

|          |             |           |         |
|----------|-------------|-----------|---------|
| 4 octets | [7-3-5-7]:2 |           |         |
| 3 octets | [7-3-5]:2   | [3-5-7]:2 |         |
| 2 octets | [7-3]:3     | [3-5]:2   | [5-7]:2 |

En deuxième lieu, le système compacteur remplace certaines séquences  $S_i$  du TABLEAU 1 par un code 25 d'instructions spécifiques.

Le code d'instructions spécifiques  $C_i$  est déterminé chronologiquement à partir du premier code correspondant à

une instruction standard. Dans un code intermédiaire objet courant, il existe à ce jour 106 instructions standard et les codes de ces instructions sont compris entre 000 et 105. Le premier code d'instructions spécifiques  $C_i$  peut alors 5 être la valeur 106, le second la valeur 107 et ainsi de suite. Chaque fois que les séquences d'instructions identiques  $S_i$  sont remplacées par un nouveau code d'instructions spécifiques  $C_i$ , une fois qu'une telle opération est terminée, la liste représentée dans le tableau 10 précédent est alors recalculée.

A titre d'exemple non limitatif et dans le cas du remplacement de la séquence d'instructions de 4 octets représentée au tableau précédent, la séquence 7-3-5-7, et allocation d'un code spécifique correspondant 106, le 15 programme de type code objet intermédiaire compacté devient :

1-106-3-106.

Dans ces conditions, il n'existe plus de séquence d'instructions standard  $I_s$ , et d'instructions spécifiques  $IS$  20 se retrouvant à l'identique au moins deux fois. Bien entendu, le fichier constitutif du programme de type code objet intermédiaire compressé FCC et le fichier d'exécution ou de décompactage de ce dernier sont mémorisés au niveau du système compacteur précité.

Après l'opération de compactage réalisée par le système compacteur, on dispose du programme de type code objet intermédiaire proprement dit, exécutable par le système cible, et du fichier d'exécution FEX précité. Le premier précité contient des instructions standard  $I_s$  et des 25 instructions spécifiques  $IS$ , alors que le second comporte au moins un tableau permettant de lier les codes spécifiques  $C_i$  avec les séries d'instructions standard  $S_i$  remplacées par

les codes spécifiques précités. Bien entendu, ces deux fichiers peuvent être regroupés en un seul et même fichier en vue du transfert de ce dernier au système cible destinataire, c'est-à-dire à la carte à microprocesseur 5 destinée à recevoir ce dernier.

En ce qui concerne le fichier d'exécution FEX, on indique que celui-ci comporte au moins un fichier, noté MEM-SEQ, constitué par une succession de plusieurs champs tels qu'un champ de code spécifique  $C_i$ , un champ de séquence 10  $S_i$ , tel que mentionné précédemment.

Suite à l'opération précitée, le fichier unique ou, le cas échéant, les deux fichiers précités, sont transmis au système cible et directement traités par un programme de chargement. Ce programme de chargement est principalement 15 chargé d'écrire en mémoire programmable 18a ou en mémoire morte 18b les données reçues en vue d'une bonne exécution par la suite.

A titre d'exemple non limitatif, on indique que le fichier relatif au programme de type code objet 20 intermédiaire compacté FCC est stocké sans traitement à partir d'une adresse déterminée, notée ADR-MEM-PGM, dans la mémoire programmable 18a précitée.

En ce qui concerne le fichier d'exécution FEX, on indique que vis-à-vis de ce dernier, le programme de 25 chargement analyse les données de ce fichier et crée dynamiquement un tableau, noté TAB-PRO, permettant d'associer les codes d'instructions spécifiques  $C_i$  avec les séries d'instructions. En fait, le tableau TAB-PRO permet d'assurer une correspondance biunivoque entre les codes 30 d'instructions spécifiques précités  $C_i$  et une adresse d'implantation permettant l'exécution des instructions correspondantes.

Une implantation, d'une part, du fichier support du programme de type code objet intermédiaire compacté FCC, du fichier d'exécution FEX et du fichier TAB-PRO précédemment cité, ce dernier fichier ayant été engendré par le programme 5 de chargement dans la mémoire programmable 18a de la carte à microprocesseur, est représentée en figure 2c.

Sur cette figure, alors que le tableau des codes d'instructions standard  $I_s$  est mémorisé au niveau de l'interpréteur en un tableau TAB-STD, le fichier d'exécution 10 FEX et le fichier TAB-PRO permettant d'assurer la correspondance des sauts d'adresse avec les codes d'instructions spécifiques  $C_i$ , ces deux tableaux permettant l'exécution effective au niveau du microprocesseur de l'unité cible du programme de type code objet intermédiaire 15 compacté FCC, sont au contraire mémorisés dans la mémoire programmable 18a. On dispose ainsi d'un ensemble exécutable par l'intermédiaire de l'interpréteur dans les conditions qui seront décrites ci-après.

Préalablement à la description de l'exécution d'un 20 programme de type code objet intermédiaire compacté FCC, une description détaillée de la structure des fichiers d'exécution FEX et du fichier TAB-PRO et de la relation fonctionnelle entre ces derniers sera maintenant donnée en liaison avec les figures 3a et 3b.

25 Sur la figure 3a, on a représenté le fichier d'exécution FEX de manière détaillée, celui-ci comportant ainsi que mentionné précédemment, outre les champs de codes spécifiques  $C_i$  et de séquences d'instructions  $S_i$ , un champ de fin de macro-instructions, noté FM, indiquant en fait la 30 fin de la séquence précitée. Dans un mode de réalisation non limitatif, chaque code spécifique  $C_i$  peut être inscrit au début du champ, sur un octet par exemple, puis chaque

séquence correspondante  $S_i$  est inscrite dans un second champ de longueur variable. Le code de fin de macro FM est de type standard et correspond à celui utilisé par le langage classique précédemment indiqué dans la description.

5 Lors de la réception du fichier d'exécution FEX dont la structure de données correspond à celle représentée en figure 3a par exemple, les différents champs  $C_i$ ,  $S_i$  et FM sont traités séparément.

En premier lieu, le code spécifique  $C_i$  de 10 l'instruction spécifique IS correspondante est écrit dans le fichier TAB-PRO et la séquence d'instructions  $S_i$  associée à ce code spécifique constitutive de l'instruction spécifique précitée est écrite dans un fichier ou mémoire référencée MEM-SEQ à partir d'une adresse notée ADR-1. Le code  $C_i$  de 15 l'instruction spécifique correspondante est écrit à l'adresse TAB-PRO + 3 x (CODE-106). Dans cette relation, on indique que l'adresse TAB-PRO est l'adresse d'ouverture du fichier TAB-PRO, alors que la valeur CODE représente la valeur numérique du code  $C_i$  correspondant. Sur la figure 3b, 20 on a représenté le mode opératoire correspondant pour une valeur d'adresse TAB-PRO égale arbitrairement à 0, le premier code spécifique alloué ayant la valeur 106 et les autres codes spécifiques alloués successifs ayant des valeurs 107 et suivantes. On indique que sur la figure 3b 25 seuls quatre codes spécifiques 106, 107, 110 et 120 ont été représentés pour une meilleure compréhension, les autres espaces mémoire étant remplis par des valeurs arbitraires.

Dans ces conditions, Adr-i est la première adresse 30 disponible dans la mémoire MEM-SEQ, cette adresse correspondant à l'adresse Adr-1 pour la première séquence d'instructions  $S_i = S_1$ . A partir de cette première adresse, laquelle constitue l'adresse d'ouverture du fichier dans la

mémoire MEM-SEQ, les séquences d'instructions  $S_i$  sont ainsi écrites de façon séquentielle dans l'ordre de leur chargement. Le code FM de fin de macro est également écrit à la fin de la série correspondante.

5 A la suite de l'écriture précitée dans la mémoire MEM-SEQ et après une étape de vérification correcte d'écriture, le programme de chargement écrit dans le tableau TAB-PRO à la suite de chaque code spécifique  $C_i$  la valeur de l'adresse d'écriture de la séquence dans la mémoire MEM-SEQ.

10 Le programme de chargement recalcule alors une nouvelle adresse d'écriture pour la prochaine séquence  $S_i$  de rang  $i$  incrémenté ou décrémenté en fonction du mode de parcours des séquences d'instructions  $S_i$  précitées.

Un processus d'exécution d'un programme de type code objet intermédiaire compacté supporté par un fichier FCC précédemment décrit et contenant des instructions spécifiques sera maintenant décrit en référence à la figure 4.

L'exécution d'un tel programme s'effectue par l'intermédiaire de l'interpréteur à l'aide d'un pointeur d'instruction, noté PI. En fait, le pointeur d'instruction PI lit le code de l'instruction à exécuter, instruction standard Is ou instruction spécifique IS, et présente ce code à l'interpréteur qui déclenche ensuite les actions correspondant à ce dernier.

Au début de l'exécution d'un programme, le pointeur d'instruction PI est chargé avec l'adresse de début de ce programme, c'est-à-dire l'adresse ADR-MEM-PGM.

L'interpréteur analyse la valeur du code lu par le pointeur d'instruction PI. Dans le cadre de cette analyse, ce dernier détermine si cette valeur de code correspond à un code de type standard Cs ou au contraire à un code de type

spécifique  $C_i$ . Cette opération est réalisée à partir du tableau TAB-STB mémorisé au niveau de l'interpréteur et associant les codes d'instructions standard, et donc les instructions standard  $I_s$ , avec les adresses d'exécution dans 5 son programme.

Si la valeur du code lu n'est pas dans ce dernier tableau, l'interpréteur provoque un appel en lecture dans le tableau TAB-PRO afin de vérifier l'existence de la valeur du code lu dans ce dernier tableau. Si le code lu n'est pas non 10 plus dans ce dernier tableau, l'interpréteur est incapable d'exécuter l'instruction lue et l'exécution du programme s'arrête en indiquant un message d'erreur, non décrit dans 15 l'organigramme de la figure 4.

Sur la figure 4 précitée, on a représenté par 2000 20 le début de l'opération d'exécution, 2001 l'opération d'initialisation du pointeur d'instruction PI à la première instruction du programme et 2002 une opération de lecture de l'instruction pointée par le pointeur d'instruction PI. 25 Cette opération correspond en fait à la lecture de la valeur de code précitée.

De la même manière, à l'étape 2003 de la figure 4, l'appartenance ou la non-appartenance de la valeur de code lu au tableau des codes standard TAB-STB et l'appartenance 25 de cette valeur de code lu au tableau TAB-PRO permet en fait de constituer le test 2003 précité, l'instruction lue  $I_s$  étant ainsi discriminée en qualité d'instruction standard  $I_s$  ou instruction spécifique  $I_s$ . La situation d'absence d'appartenance du code lu et de l'instruction lue correspondante à l'un et l'autre des deux tableaux 30 génératrice d'un message d'erreur n'est pas représentée en figure 4, afin de ne pas surcharger le dessin.

Si, sur réponse positive au test 2003 précité, le code lu correspond à une instruction spécifique, la valeur du pointeur d'instruction PI, pour lire l'instruction suivante, est calculée et mémorisée dans la pile.

5 L'interpréteur lit dans le tableau TAB-PRO la valeur de l'adresse de la séquence d'instructions Si associée au code spécifique Ci lu et initialise la valeur du pointeur d'instruction PI avec cette valeur. L'ensemble de ces opérations porte la référence 2004 sur la figure 4 précitée.

10 A la suite de l'étape 2004 précitée, l'interpréteur boucle de nouveau à l'étape lecture du code, ainsi que représenté en figure 4, par retour à l'étape 2002.

Si, sur réponse négative au test 2003, le code lu correspond à une instruction de type standard Is, l'interpréteur contrôle dans une étape de test 2005 si la valeur de ce code correspond à une valeur de fin de macro représentant en fait une fin de séquence. Si tel est le cas, la valeur précédemment mémorisée dans la mémoire de pile est extraite et la pile est mise à jour, cette valeur étant chargée dans le pointeur d'instruction PI. L'opération d'extraction de la pile de la valeur précédemment mémorisée constituant une adresse de retour puis de remise à jour de la pile, est représentée en 2006, l'adresse de retour étant notée ADR-RET. Suite à l'étape 2006 précitée, l'interpréteur boucle de nouveau le processus à l'étape de lecture de la valeur de code, c'est-à-dire à l'étape 2002. Si, sur réponse négative au test 2005, la valeur du code lu correspondant à une instruction de type standard ne correspond toutefois pas à une fin de macro ou fin de série, alors, le code est exécuté de façon connue en tant que telle par l'interpréteur. Ainsi qu'on l'a toutefois représenté en figure 4, une étape de test 2007 est prévue dans ce cas

préalablement à l'exécution proprement dite de l'instruction standard précitée. Le test 2007 consiste à vérifier que la valeur du code et l'instruction INS correspondante ne correspond pas à celle d'une fin de programme. Sur réponse 5 positive au test 2007 précité, l'étape d'exécution 2008 de cette instruction par l'interpréteur est alors réalisée, à cette étape d'exécution étant associée une étape d'incrémantation du pointeur d'instruction vers l'instruction suivante. Suite à l'étape 2008 précitée, 10 l'interpréteur reboucle vers l'étape de lecture de la valeur de code pointée par le pointeur d'instruction PI, c'est-à-dire l'étape de lecture 2002.

Sur réponse négative au test 2007, l'instruction correspondant à une instruction de fin de programme, une 15 étape de fin 2009 est réalisée. L'interpréteur dans ce cas arrête son action et donne la main au système d'exploitation OS. Celui-ci attend alors une nouvelle instruction de commande.

Le mode de réalisation et de mise en œuvre du 20 processus d'exécution d'un programme de type code objet intermédiaire compacté, tel que décrit précédemment en liaison avec la figure 4, n'est pas limitatif.

En premier lieu, on indique que la mémoire de pile peut être subdivisée en deux mémoires de pile séparées, une 25 mémoire de pile pour les instructions standard Is, et une mémoire de pile pour les instructions spécifiques IS encore désignées par macro-instructions. Dans un tel mode de réalisation, on connaît le nombre maximal d'imbriques d'instructions spécifiques IS intraprocéduralement. Pour 30 avoir la taille totale occupée par cette pile, il suffit de multiplier par le nombre maximal de procédures imbriquées. La mise en œuvre d'une mémoire de pile séparée pour les

instructions spécifiques IS procure, par rapport à l'utilisation d'une seule pile, une réduction de la consommation totale de mémoire.

En outre, afin d'augmenter le nombre d'instructions spécifiques IS utilisables en lieu et place du nombre d'instructions spécifiques limité entre 106 et 255 dans l'exemple précédemment donné dans la description, les codes spécifiques  $C_i$  peuvent avantageusement être codés sur deux octets. Dans ces conditions, une valeur de code particulière, telle que la valeur 255, peut alors indiquer le codage sur deux octets.

Enfin, le système cible, lorsque ce dernier est constitué par un système embarqué multi-applications, comprend plusieurs programmes compilés et compactés, c'est-à-dire plusieurs fichiers FCC précédemment décrits dans la description. Ces programmes doivent fonctionner de manière indépendante. Dans un tel cas, l'interpréteur étant unique, il exécute tous les programmes d'applications chargés par le programme de chargement. Si deux programmes d'applications utilisent des instructions spécifiques, dans le mode de réalisation précédemment décrit dans la description, il est possible que le système compacteur affecte le même code spécifique  $C_i$  pour deux séries d'instructions différentes.

Afin de remédier à une telle situation et pour permettre à l'interpréteur de distinguer les deux codes, les champs du fichier d'exécution FEX tels que représentés précédemment en figure 3a peuvent être complétés par un troisième paramètre relatif à un numéro d'identification de l'application considérée. Ce numéro d'identification est alors mémorisé également pour chaque code spécifique affecté dans le tableau TAB-PRO. Ce dernier paramètre constitue en fait la référence du programme chargé en même temps que le

fichier contenant le tableau permettant d'associer chaque code d'instruction spécifique  $C_i$  avec des séquences d'instructions  $S_i$  remplacées par ces derniers pour l'application considérée. Lors de l'exécution de 5 l'application du programme par l'interpréteur, ce dernier peut ainsi assurer la discrimination des instructions spécifiques relatives à cette application.

Bien entendu, le processus précédemment décrit permettant la mise en œuvre d'un système embarqué multi- 10 applications présente l'inconvénient d'une consommation accrue de mémoire, du fait de l'attribution d'un champ supplémentaire relatif au numéro d'application considérée.

Un processus plus avantageux sera maintenant décrit en liaison avec la figure 5.

15 Relativement à la figure 5, on considère un système embarqué tel qu'une carte à microprocesseur comportant plusieurs applications, notées  $A_1$  à  $A_k$ , les valeurs  $A_1$  à  $A_k$  constituant en fait des numéros d'identification de chaque application. Dans ce but, lors du compactage, conformément 20 au procédé objet de la présente invention tel que décrit précédemment dans la description, de tout programme ou application source de numéro d'identification donné à  $A_1$  à  $A_{k-1}$  par exemple, le système cible, c'est-à-dire carte à microprocesseur, transmet au compacteur le contenu de la 25 mémoire MEM-SEQ avec bien entendu les codes spécifiques  $C_i$  correspondants. En fait, le système cible recalcule à partir du fichier ou tableau TAB-PRO et du contenu de la mémoire MEM-SEQ un fichier des coefficients spécifiques antérieurs, noté F-C-ANT, relatif aux applications  $A_1$  à  $A_{k-1}$ . Le fichier 30 F-C-ANT assure la mise en correspondance biunivoque de chaque code spécifique  $C_i$  et de la séquence  $S_i$  associée à ce dernier pour l'ensemble des applications  $A_1$  à  $A_{k-1}$ . Dans ces

conditions et dans un mode de réalisation non limitatif simplifié, le fichier F-C-ANT peut consister en un fichier de même format que le fichier FEX précité. Dans le processus de compactage préférentiel tel que représenté en figure 5, 5 le fichier F-C-ANT des codes spécifiques antérieurs est alors communiqué au compacteur afin d'assurer un apprentissage de ce dernier.

Lors du compactage d'une nouvelle application, de numéro d'identification  $A_k$ , le compacteur recherche toutes 10 les occurrences des séquences d'instructions  $S_i$  déjà enregistrées dans le fichier F-C-ANT, c'est-à-dire en fait dans le tableau TAB-PRO du système cible pour les applications antérieures  $A_1$  à  $A_{k-1}$ . A chaque occurrence trouvée, le système compacteur remplace la séquence 15 d'instructions correspondante  $S_i$  par le code spécifique  $C_i$  de l'instruction spécifique IS correspondante. Cette opération étant effectuée, le système compacteur peut alors analyser l'application de code d'identification  $A_k$  et bien entendu rechercher d'autres occurrences en vue de créer des 20 instructions spécifiques supplémentaires qui n'ont pas encore été mémorisées. Une mise à jour du fichier F-C-ANT peut alors être effectuée. Le processus de décompactage décrit en liaison avec la figure 5 peut être mis en œuvre de manière particulièrement avantageuse pour assurer le 25 compactage, soit de programmes chargés pour la première fois dans le système embarqué, soit de programmes chargés en supplément à d'autres programmes compactés existants dans le système embarqué.

Dans les deux hypothèses précitées, le procédé de 30 compactage, objet de l'invention, consiste à mémoriser la table d'exécution relative à au moins un programme intermédiaire de type code objet compacté, le premier de ces

programmes dans la première hypothèse et un ou plusieurs programmes compactés existants dans la deuxième hypothèse, puis pour tout programme intermédiaire supplémentaire, à lire la table d'exécution mémorisée et à effectuer le  
5 compactage de tout programme supplémentaire, en tenant compte des instructions et codes spécifiques mémorisés dans la table d'exécution, ainsi que décrit précédemment dans la description. Bien entendu, le programme de type code objet intermédiaire compacté ainsi créé ne peut alors être exécuté  
10 que sur le système cible qui a fourni précédemment au système compacteur le fichier F-C-ANT pertinent correspondant.

Dans le cadre de la mise en œuvre du procédé de compactage d'un programme de type code objet intermédiaire,  
15 tout système embarqué, tel qu'un objet portatif multi-applications formé par exemple par une carte à microprocesseur et comportant des ressources de calcul tel qu'un microprocesseur, une mémoire programmable, une mémoire morte et un interpréteur de langage, comprend, en référence  
20 avec la figure 2c précédemment introduite dans la description, au moins, outre le tableau TAB-STD des codes standard constitutifs d'un programme de type code objet intermédiaire mémorisé au niveau de l'interpréteur, un ensemble de fichiers mémorisés dans la mémoire programmable  
25 18a par exemple.

Ainsi, l'objet portatif correspondant comprend au moins un programme de type code objet intermédiaire compacté, c'est-à-dire le fichier FCC représenté en figure 2c. Ce fichier peut être constitutif d'une application telle  
30 que mentionnée précédemment, soit d'une fonction telle qu'une fonction de chiffrement/déchiffrement de données ou analogue. Ce fichier de type code objet intermédiaire

compacté consiste bien entendu en une suite de codes d'instructions spécifiques  $C_i$  et de codes d'instructions standard correspondant aux codes d'instructions du programme de type code objet intermédiaire précité. Les codes 5 d'instructions spécifiques  $C_i$  correspondent à des séquences d'instructions standard successives  $S_i$  précédemment mentionnées dans la description.

En outre, ainsi que représenté sur la figure 2c précitée, une table d'exécution permet la mise en 10 correspondance biunivoque entre chaque code opératoire spécifique  $C_i$  et la séquence d'instructions standard successives  $S_i$  associée à ce dernier. L'ensemble de ces fichiers permet d'optimiser l'espace mémoire occupé dans la mémoire, notamment la mémoire programmable 18a, de l'objet 15 portatif.

Ainsi que représenté d'ailleurs en figure 2c, la table d'exécution comprend au moins un fichier des séquences successives correspondant aux instructions spécifiques, fichier désigné par la mémoire MEM-SEQ, et un tableau, 20 désigné par TAB-PRO des codes d'instructions spécifiques et des adresses d'implantation de ces instructions spécifiques dans le fichier des séquences successives.

L'exécution du programme de type code objet intermédiaire compacté est alors réalisée, ainsi que 25 représenté en figure 4.

Un système de compactage d'un programme de type code objet intermédiaire permettant la mise en œuvre du procédé de compactage précédemment décrit dans la description sera maintenant donné en liaison avec les figures 6a et 6b.

30 D'une manière générale, le système de compactage, objet de la présente invention, sera décrit comme une combinaison de modules, ces modules pouvant être mis en

œuvre, soit de manière matérielle, soit, préférentiellement, de manière logicielle, les flux de données entre ces modules étant représentés.

Ainsi, sur la figure 6a, on a représenté le système de compactage, objet de la présente invention, lequel est réputé comprendre au moins un module A d'analyse de toutes les instructions directement exécutables, constitutives du programme de type code objet intermédiaire, noté COD-OBJ-INT. D'une manière générale, le fichier informatique support du programme de type code objet intermédiaire précité est considéré comme une chaîne d'octets, ou chaîne de caractères, et le mode opératoire du système de compactage, objet de la présente invention, sera donné dans une optique de traitement de chaîne correspondant.

A partir de la chaîne d'octets précitée, le module d'analyse A permet, par lecture du programme de type code objet COD-OBJ-INT, de discriminer et établir une liste de toutes les séquences d'instructions standard  $S_i$  contenues dans le programme précité. Sur la figure 6a, les séquences d'instructions standard  $S_1, S_{i-1}, S_i, S_{i+1}, \dots, S_p$ , sont ainsi notées sous forme symbolique d'une liste selon la notation symbolique des listes. On comprend ainsi que le module d'analyse A peut consister en une fenêtre glissante correspondant à un nombre  $n_i$  d'octets, cette fenêtre glissante permettant d'assurer l'analyse des séquences  $S_i$  ainsi que précédemment mentionnées en référence avec le tableau 1 de la description. La fenêtre glissante assure en fait une discrimination de chaque séquence  $S_i$  par défilement relatif de la chaîne d'octets vis-à-vis de la fenêtre précitée. A chaque occurrence de la séquence  $S_i$  considérée, un bit de comptage BC est délivré par le module d'analyse A.

Ainsi que représenté en outre en figure 6a, le système de compactage, objet de la présente invention, comprend un module C de comptage du nombre d'occurrences dans le programme de type code objet précité de chacune des 5 séquences d'instructions directement exécutables  $S_i$  précédemment mentionnées. Le module de comptage C peut être réalisé par un module logiciel, lequel compte le nombre de bits successifs à la valeur 1 du bit de comptage BC précité. Le module de comptage C permet de mémoriser les nombres 10 d'occurrences  $N_1 \dots N_{i-1}$ ,  $N_i$ ,  $N_{i+1} \dots N_p$  de chaque séquence  $S_1 \dots S_{i-1}$  à  $S_{i+1} \dots S_p$  correspondante et suivante. Cette mémorisation peut être effectuée sous forme d'une liste.

En outre, ainsi que représenté sur la figure 6a, un module AL d'allocation à au moins une séquence 15 d'instructions directement exécutables  $S_i$  d'un code spécifique  $C_i$  associé à cette séquence  $S_i$  est prévu pour engendrer une instruction spécifique, notée  $IS_i$  sur la figure 6a, sur critère de supériorité de la fonction d'au moins le nombre  $N_i$  d'occurrences correspondant vis-à-vis 20 d'une valeur de référence ainsi que mentionné précédemment dans la description.

Dans le cas où la fonction d'au moins le nombre  $N_i$  est supérieure à la valeur de la fonction de la valeur de référence précitée, le module AL délivre une commande de 25 compactage COM-COMP, lequel peut consister en un bit à la valeur 1 ou 0 correspondante.

Enfin, le système de compactage, objet de la présente invention, comprend un module de compactage proprement dit COMP, lequel reçoit, d'une part, le fichier 30 relatif au programme de type code objet intermédiaire précité COD-OBJ-INT et la commande de comptage COM-COMP. Le module de compactage proprement dit COMP permet en fait

d'assurer le remplacement dans le programme de type code objet précité, considéré comme une chaîne d'octets, de chaque occurrence de toute séquence  $S_i$  correspondant à une instruction spécifique  $IS_i$  par le code spécifique  $C_i$  associé 5 à cette séquence d'instructions.

En ce qui concerne le mode opératoire du module de compactage COMP proprement dit, on indique que celui-ci peut comprendre un sous-module de lecture par fenêtre glissante analogue à celui du module d'analyse, permettant de 10 localiser la séquence d'instructions standard  $S_i$  dans la chaîne d'octets précitée. En pratique, sur localisation de la séquence d'instructions standard  $S_i$  précitée, ainsi que représenté de manière illustrative en figure 6a, le module de compactage peut comprendre un sous-module de partition à 15 gauche et de partition à droite de la séquence  $S_i$  considérée, pour engendrer une chaîne gauche, notée LS, et une chaîne droite, notée RS. Il peut comporter ensuite, à partir du code spécifique  $C_i$  constitutif de l'instruction spécifique  $IS_i$ , un module de concaténation permettant, d'une 20 part, la concaténation du code spécifique  $C_i$  correspondant, considéré comme une chaîne d'octets, à la chaîne gauche LS par exemple, puis concaténation de l'ensemble ainsi formé à la chaîne droite RS, ce qui permet d'assurer le remplacement de la séquence  $S_i$  par le code spécifique  $C_i$ . Le module de 25 compactage proprement dit COMP délivre ainsi un programme de type code objet intermédiaire compacté, noté sur la figure 6a, COD-OBJ-INT-COMP. Bien entendu, le système de compactage représenté en figure 6a permet l'application du processus de compactage précédemment décrit à l'ensemble de toutes les 30 séquences d'instructions directement exécutables  $S_i$  considérées.

En ce qui concerne le module d'allocation AL, dans un mode de réalisation non limitatif, on indique que celui-ci, ainsi que représenté en figure 6b, peut comporter un module de calcul en nombre d'octets de la longueur  $n_i$  de la 5 séquence d'instructions  $S_i$ , ce module étant désigné par  $AL_1$  sur la figure 6b. Il peut comporter également un module de calcul, noté  $AL_2$ , du produit de cette longueur  $n_i$  et du nombre d'occurrences  $N_i$  de cette séquence  $S_i$  d'instructions standard. Ce produit, noté  $P_i$ , est représentatif du gain de 10 compactage pour la séquence d'instructions directement exécutables  $S_i$  considérée.

En outre, le module d'allocation AL peut comprendre un module de comparaison, noté  $AL_3$ , de ce produit  $P_i$  à une valeur de seuil, notée  $S$ , déterminée. La valeur du seuil  $S$  15 peut être déterminée expérimentalement. Elle peut également être établie à partir de cette valeur expérimentale pour correspondre, pour un programme de type code objet intermédiaire de longueur donnée, à un pourcentage donné de cette longueur.

20 Sur réponse négative au test de comparaison effectué par le module  $AL_3$ , le rang  $i$  de chaque séquence d'instructions directement exécutables  $S_i$  est incrémenté d'une unité et la nouvelle valeur de  $i$  est renvoyée au module d'analyse A, d'une part, et au module de comptage C, 25 d'autre part.

Sur réponse positive au test de comparaison réalisé par le module  $AL_3$ , un module  $AL_4$  permet d'établir un code spécifique  $C_i$  correspondant et, enfin, un module  $AL_5$  permet d'assurer en correspondance biunivoque l'écriture du code 30 spécifique  $C_i$  et de la séquence  $S_i$  considérée d'instructions directement exécutables pour constituer l'instruction spécifique  $IS_i$ .

En ce qui concerne le module AL<sub>4</sub>, on indique que celui-ci peut être réalisé par un module logiciel de comptage permettant, à partir d'une valeur de départ, par exemple la valeur 106 précédemment mentionnée dans la 5 description, d'allouer une valeur correspondante pour la séquence d'instructions S<sub>i</sub> considérée. Chaque instruction spécifique IS<sub>i</sub> peut alors être écrite sous forme d'une liste correspondante.

Des essais en temps réel de compactage de programmes 10 ou applications contenus dans des cartes à microprocesseur commercialisées par la société BULL CP8 en France ont montré un gain de compactage supérieur à 33%, ce qui permet en fait, lors d'une application du processus de compactage à un nombre égal à trois applications pour un objet portatif 15 mobile, de gagner sensiblement une application supplémentaire pour ce type d'objet.

Un tel gain de compactage a été obtenu dans des conditions sensiblement normales d'utilisation par l'utilisateur, alors que le ralentissement introduit par 20 l'appel de macro-instructions, ce ralentissement étant inhérent à l'appel successif en lecture au niveau du tableau TAB-B-PRO et du fichier MEM-SEQ, n'excède pas sensiblement 10% du temps d'exécution en l'absence de macro-instructions.

REVENDICATIONS

1. Procédé de compactage d'un programme intermédiaire consistant en une suite d'instructions standard, utilisé dans un système embarqué, ce système 5 embarqué étant doté d'une mémoire et d'un interpréteur de langage du programme intermédiaire en instructions d'un code objet directement exécutables par un microprocesseur, procédé suivant lequel :
  - a) on recherche dans le programme intermédiaire des 10 séquences identiques d'instructions standard successives ;
  - b) on soumet les séquences identiques d'instructions successives à un test de comparaison de supériorité d'une fonction d'au moins le nombre d'occurrences de ces 15 séquences dans ledit programme intermédiaire à une valeur de référence et, sur réponse positive audit test, pour chaque séquence identique d'instructions standard successives satisfaisant à ladite étape de test,
  - c) on engendre une instruction spécifique par définition 20 d'un code opératoire spécifique et association à ce code opératoire spécifique de ladite séquence d'instructions standard successives ayant satisfait audit test ;
  - d) on remplace dans ledit programme intermédiaire chaque 25 occurrence de chaque séquence d'instructions successives par ledit code opératoire spécifique qui lui est associé pour obtenir un programme intermédiaire compacté, consistant en une succession d'instructions standard et de codes opératoires spécifiques, et
  - e) on mémorise dans ladite mémoire une table d'exécution 30 permettant la mise en correspondance biunivoque entre chaque code opératoire spécifique introduit et la séquence d'instructions successives associée à ce

dernier, ce qui permet d'optimiser l'espace mémoire occupé par ledit programme intermédiaire compacté par mémorisation dans ladite mémoire d'une seule occurrence desdites séquences identiques d'instructions successives.

5 2. Procédé selon la revendication 1, caractérisé en ce que ladite fonction est en outre fonction de la taille de chaque séquence identique d'instructions successives.

10 3. Procédé selon la revendication 1, caractérisé en ce que pour la mise en œuvre d'un compactage d'une pluralité de programmes intermédiaires, ledit procédé consiste en outre :

15 - à mémoriser la table d'exécution relative à au moins un programme intermédiaire compacté, et pour tout programme intermédiaire supplémentaire soumis à un processus de compactage ;

- à lire ladite table d'exécution mémorisée, et

- à effectuer le compactage de tout programme supplémentaire, compte tenu des instructions et codes spécifiques mémorisés dans cette table d'exécution.

20 4. Procédé d'exécution d'un programme intermédiaire compacté obtenu par la mise en œuvre du procédé de compactage selon la revendication 1, et consistant en une succession d'instructions standard et de codes opératoires spécifiques mémorisés dans la mémoire d'un système embarqué, 25 caractérisé en ce qu'il consiste :

- à reconnaître dans ladite mémoire l'existence d'une table d'exécution mémorisée comportant au moins une séquence d'instructions successives associée à un code opératoire spécifique en correspondance biunivoque ;

30 - à appeler, par l'intermédiaire de l'interpréteur, une commande de lecture des instructions standard ou codes opératoires spécifiques successifs du programme

intermédiaire compacté et, en présence d'un code opératoire spécifique :

- appeler par instruction de lecture dans la mémoire ladite séquence d'instructions successives associée audit code opératoire spécifique et, en présence d'une instruction standard,
- appeler par instruction de lecture l'exécution de cette instruction.

5 5. Procédé selon la revendication 4, caractérisé en ce que lorsqu'une séquence d'instructions successives associée à un code opératoire spécifique est appelée, la valeur courante d'un compteur de programme est incrémentée dans une pile associée aux codes opératoires spécifiques, et un pointeur de programme pointe vers la première instruction 10 de ladite séquence d'instructions spécifique, puis, sur exécution d'une instruction de fin de séquence d'instructions spécifiques, ledit compteur de programme est décrémenté, et l'exécution se poursuit à partir de 15 l'instruction ou du code opératoire spécifique suivant.

20 6. Procédé selon la revendication 5, caractérisé en ce que la pile associée aux codes opératoires spécifiques et la pile associée aux instructions standard sont constituées par une pile unique.

25 7. Système embarqué multi-applications comprenant des ressources de calcul, une mémoire et un interpréteur de langage d'un programme intermédiaire en instructions directement exécutables par ces ressources de calcul, caractérisé en ce que ledit système embarqué multi-applications comporte au moins, outre un tableau des codes 30 standard constitutifs dudit programme intermédiaire mémorisé au niveau dudit interpréteur :

- au moins un programme intermédiaire compacté, constitutif d'une application et consistant en une suite de codes d'instructions spécifiques et de codes d'instructions standard, lesdits codes d'instructions spécifiques 5 correspondant à des séquences d'instructions standard successives ;

- une table d'exécution permettant la mise en correspondance biunivoque entre code opératoire spécifique et la séquence d'instructions standard successives associée 10 à ce dernier, ledit au moins un programme intermédiaire compacté et ladite table d'exécution étant mémorisés dans ladite mémoire, ce qui permet d'optimiser l'espace mémoire occupé par ledit programme intermédiaire compacté par mémorisation dans ladite mémoire programmable d'une seule 15 occurrence desdites séquences identiques d'instructions successives.

8. Système embarqué selon la revendication 7, caractérisé en ce que ladite table d'exécution comprend au moins :

20 - un fichier des séquences successives correspondant aux instructions spécifiques ;

- un tableau des codes d'instructions spécifiques et des adresses d'implantation de ces instructions spécifiques dans la table des séquences successives.

25 9. Système embarqué selon la revendication 8, caractérisé en ce que ledit fichier des séquences successives correspondant aux instructions spécifiques et ledit tableau des codes d'instructions spécifiques sont mémorisés en mémoire programmable dudit système embarqué.

30 10. Système de compactage d'un programme intermédiaire, ce programme intermédiaire consistant en une série d'instructions standard exécutables par une unité

cible, caractérisé en ce que ledit système comprend au moins :

- des moyens d'analyse de toutes les instructions standard exécutables permettant par lecture dudit programme intermédiaire de discriminer et établir une liste de toutes les séquences d'instructions standard exécutables contenues dans ce programme intermédiaire ;
- des moyens de comptage du nombre d'occurrences, dans ce programme intermédiaire, de chacune des séquences d'instructions standard exécutables membre de cette liste ;
- des moyens d'allocation à au moins une séquence d'instructions standard exécutables d'un code spécifique associé à cette séquence d'instructions standard exécutables pour engendrer une instruction spécifique ;
- des moyens de remplacement dans le programme de chaque occurrence de cette séquence d'instructions standard exécutables par le code spécifique associé à cette séquence d'instructions standard exécutables, représentatif de ladite instruction spécifique, ce qui permet d'engendrer un programme compacté, comprenant une succession d'instructions standard exécutables et d'instructions spécifiques.

11. Système selon la revendication 10, caractérisé en ce que lesdits moyens d'allocation à au moins une séquence d'instructions standard exécutables d'un code spécifique associé à cette séquence d'instructions standard exécutables pour engendrer une instruction spécifique comportent au moins :

- des moyens de calcul de la valeur d'une fonction d'au moins la longueur et du nombre d'occurrences de cette séquence d'instructions standard exécutables, ladite fonction étant représentative du gain de compactage pour cette séquence d'instructions standard exécutables ;

- des moyens de comparaison de la valeur de cette fonction à une valeur de seuil, et, sur réponse positive à ladite comparaison,
- des moyens d'écriture dans un fichier en correspondance biunivoque d'un code spécifique et de cette séquence d'instructions standard exécutables pour constituer ladite instruction spécifique.



FIG.1a.  
(ART ANTERIEUR)



FIG.1b.  
(ART ANTERIEUR)



FIG.1c.  
(ART ANTERIEUR)



FIG.2a.



FIG.2c.



|     |                    |  |
|-----|--------------------|--|
| 120 | SÉQUENCE 4 = $s_4$ |  |
| 110 | SÉQUENCE 3 = $s_3$ |  |
| 107 | SÉQUENCE 2 = $s_2$ |  |
| 106 | SÉQUENCE 1 = $s_1$ |  |

FIG. 3a.



FIG.3b .

FIG.4







FIG. 6a.

FIG.6b.



RAPPORT DE RECHERCHE  
PRELIMINAIREétabli sur la base des dernières revendications  
déposées avant le commencement de la recherche

2785695

N° d'enregistrement  
nationalFA 569466  
FR 9814012

## DOCUMENTS CONSIDERES COMME PERTINENTS

| Catégorie | Citation du document avec indication, en cas de besoin, des parties pertinentes                                                                                                                                                                                                                                                                        | Revendications concernées de la demande examinée |
|-----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------|
| Y         | VAHID F: "Procedure exlining: a transformation for improved system and behavioral synthesis"<br>PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON SYSTEM SYNTHESIS, 13 septembre 1995 (1995-09-13), pages 508-513 89, XP002098348                                                                                                                          | 1,2,4,5,<br>7,10,11                              |
| A         | * abrégé *<br>* page 508, colonne de droite, alinéa 3 - alinéa 4 *<br>* page 509, colonne de gauche, alinéa 3 *<br>* page 509, colonne de droite, alinéa 3 *<br>---                                                                                                                                                                                    | 8,9                                              |
| Y         | DEBAERE E H: "A language coprocessor for the interpretation of threaded code"<br>MICROPROCESSING & MICROPROGRAMMING, vol. 21, no. 1-5, août 1988 (1988-08), pages 593-602, XP002114520 Amsterdam, The Netherlands                                                                                                                                      | 1,2,4,5,<br>7,10,11                              |
| A         | * abrégé *<br>* page 594, colonne de gauche, alinéa 1 - alinéa 3 *<br>* page 595, colonne de gauche, alinéa 3 *<br>* page 595, colonne de droite, alinéa 3 *<br>* page 596; tableau 3 *<br>* page 598, colonne de droite, alinéa 4 *<br>---                                                                                                            | 8,9                                              |
| A         | ACCOMAZZO E ET AL: "INTEGRATING INTERMEDIATE CODE OPTIMIZATION WITH RETARGETABLE CODE GENERATION"<br>MICROPROCESSING AND MICROPROGRAMMING, vol. 30, no. 1 / 05, 1 août 1990 (1990-08-01), pages 475-481, XP000141684<br>ISSN: 0165-6074<br>* abrégé *<br>* page 475, colonne de gauche, alinéa 2 *<br>* page 475, colonne de droite, alinéa 3 *<br>--- | 1                                                |
| 2         | Date d'achèvement de la recherche<br>7 septembre 1999                                                                                                                                                                                                                                                                                                  | Examinateur<br>Beltrán-Escavy, J                 |

EPO FORM 1503 03/82 (PO4C13)

## CATEGORIE DES DOCUMENTS CITES

X : particulièrement pertinent à lui seul  
Y : particulièrement pertinent en combinaison avec un autre document de la même catégorie  
A : pertinent à l'encontre d'au moins une revendication ou arrière-plan technologique général  
O : divulgation non écrite  
P : document intercalaire

T : théorie ou principe à la base de l'invention  
E : document de brevet bénéficiant d'une date antérieure à la date de dépôt et qui n'a été publié qu'à cette date de dépôt ou qu'à une date postérieure.  
D : cité dans la demande  
L : cité pour d'autres raisons  
& : membre de la même famille, document correspondant

REPUBLIQUE FRANÇAISE

INSTITUT NATIONAL  
de la  
PROPRIETE INDUSTRIELLERAPPORT DE RECHERCHE  
PRELIMINAIREétabli sur la base des dernières revendications  
déposées avant le commencement de la recherche

2785695

N° d'enregistrement  
nationalFA 569466  
FR 9814012

## DOCUMENTS CONSIDERES COMME PERTINENTS

| Catégorie                                                                                                                                                                                                                                                                                          | Citation du document avec indication, en cas de besoin, des parties pertinentes                                                                                                                                                                                                                                     | Revendications concernées de la demande examinée                                                                                                                                                                                                                                                                                          |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| A                                                                                                                                                                                                                                                                                                  | BIGET P ET AL: "How smart cards can benefit from object-oriented technologies"<br>FUTURE GENERATIONS COMPUTER SYSTEMS,<br>vol. 13, no. 1,<br>1 juillet 1997 (1997-07-01), page 75-90<br>XP004081711<br>ISSN: 0167-739X<br>* page 83, colonne de droite, alinéa 2 -<br>page 84, colonne de gauche, alinéa 2 *<br>--- | 1                                                                                                                                                                                                                                                                                                                                         |
| A                                                                                                                                                                                                                                                                                                  | FR 2 757 970 A (GEMPLUS CARD INT)<br>3 juillet 1998 (1998-07-03)<br>* le document en entier *<br>-----                                                                                                                                                                                                              | 1                                                                                                                                                                                                                                                                                                                                         |
| 2                                                                                                                                                                                                                                                                                                  |                                                                                                                                                                                                                                                                                                                     | DOMAINES TECHNIQUES<br>RECHERCHES (Int.CL.6)                                                                                                                                                                                                                                                                                              |
|                                                                                                                                                                                                                                                                                                    | Date d'achèvement de la recherche<br>7 septembre 1999                                                                                                                                                                                                                                                               | Examinateur<br>Beltrán-Escavy, J                                                                                                                                                                                                                                                                                                          |
| CATEGORIE DES DOCUMENTS CITES                                                                                                                                                                                                                                                                      |                                                                                                                                                                                                                                                                                                                     | T : théorie ou principe à la base de l'invention<br>E : document de brevet bénéficiant d'une date antérieure<br>à la date de dépôt et qui n'a été publié qu'à cette date<br>de dépôt ou qu'à une date postérieure.<br>D : cité dans la demande<br>L : cité pour d'autres raisons<br>& : membre de la même famille, document correspondant |
| X : particulièrement pertinent à lui seul<br>Y : particulièrement pertinent en combinaison avec un autre document de la même catégorie<br>A : pertinent à l'encontre d'au moins une revendication ou arrière-plan technologique général<br>O : divulgation non-écrite<br>P : document intercalaire |                                                                                                                                                                                                                                                                                                                     |                                                                                                                                                                                                                                                                                                                                           |

PI - VII + 1-60 (67)

XP-001199580

**Compacting Object Code via Parameterized Procedural Abstraction**

by

Michael Joseph Zastre  
B.Sc., Simon Fraser University, 1993

**A Thesis Submitted in Partial Fulfillment of the  
Requirements for the Degree of**

**MASTER OF SCIENCE**

**in the Department of Computer Science**

**We accept this thesis as conforming  
to the required standard**

---

**Dr. R. N. Horspool, Supervisor (Dept. of Computer Science)**

---

**Dr. D. M. Hoffman (Dept. of Computer Science)**

---

**Dr. G. C. Shoja (Dept. of Computer Science)**

---

**Dr. R. Vahldieck (Dept. of Electrical and Computer Engineering)**

**© Michael Joseph Zastre, 1995  
University of Victoria**

**All rights reserved. This thesis may not be reproduced in whole or in part,  
by photocopying or other means, without the permission of the author.**

Supervisor: Dr. R. N. Horspool

## Abstract

Current *compiler optimization* research focuses on program transformations that increase code speed. Decreasing object file size is sometimes a goal since less code often means faster code. In this thesis we examine *parameterized procedural abstraction*. This is an extension of an optimization whose sole purpose is to reduce code size. Previously published implementations of *procedural abstraction* have produced space savings where instruction sequences are exact matches (i.e. instruction operations *and* operands are the same between code sequences). Researchers felt that parameterizing differences had too high a cost, but no investigations have been made into the potential space savings from exploiting inexact matches. We show that permanent space savings (*compaction*) are possible when (1) covering matches with more than one procedure and (2) carefully choosing which inexact match instances are covered by each procedure. Our heuristic algorithms choose a mix of parameterized and unparameterized procedures that yield substantially better space savings in comparison to unparameterized procedures only.

### Examiners:

---

Dr. R. N. Horspool, Supervisor (Dept. of Computer Science)

---

Dr. D. M. Hoffman, Departmental Member (Dept. of Computer Science)

---

Dr. G. C. Shoja, Departmental Member (Dept. of Computer Science)

---

Dr. R. Vahldieck, External Examiner (Dept. of Electrical and Computer Engineering)

## Table of Contents

|                                                            |     |
|------------------------------------------------------------|-----|
| <b>Abstract</b> .....                                      | ii  |
| <b>Table of Contents</b> .....                             | iii |
| <b>List of Tables</b> .....                                | iv  |
| <b>List of Figures</b> .....                               | v   |
| <b>Acknowledgments</b> .....                               | vi  |
| <b>Dedication</b> .....                                    | vii |
| <br>                                                       |     |
| <b>1: Introduction</b> .....                               | 1   |
| <br>                                                       |     |
| <b>2: Previous Work</b> .....                              | 3   |
| 2.1: Compiler Optimizations Leading to Space Savings ..... | 3   |
| 2.2: Exploiting Similarity of Code .....                   | 10  |
| 2.3: Matching Code Representation to Code Properties ..... | 11  |
| 2.4: Compilation to Compact Code .....                     | 14  |
| 2.5: Analyzing and Compressing Assembly Code .....         | 21  |
| 2.6: Looking Forward .....                                 | 26  |
| <br>                                                       |     |
| <b>3: Compaction and its Problems</b> .....                | 28  |
| 3.1: The Problem .....                                     | 29  |
| 3.2: Preliminaries to Procedure Searching .....            | 31  |
| 3.3: Procedure Searching .....                             | 35  |
| 3.4: Quick Comparison of Different Methods .....           | 37  |
| 3.5: Summary .....                                         | 39  |
| <br>                                                       |     |
| <b>4: Searching for Subroutines</b> .....                  | 41  |
| 4.1: Definitions .....                                     | 41  |
| 4.2: Visualizing Operands .....                            | 43  |
| 4.3: $\Omega$ -array Partitions .....                      | 45  |
| 4.4: Heuristics .....                                      | 51  |
| 4.5: Further transformations that could be pursued .....   | 60  |
| 4.6: Summary .....                                         | 61  |
| <br>                                                       |     |
| <b>5: Summary and Conclusions</b> .....                    | 63  |
| 5.1: Summary .....                                         | 63  |
| 5.2: Future Work .....                                     | 64  |
| 5.3: Conclusion .....                                      | 65  |
| <b>References</b> .....                                    | 67  |
| <br>                                                       |     |
| <b>Glossary</b> .....                                      | 70  |

## List of Tables

|           |                                                 |    |
|-----------|-------------------------------------------------|----|
| Table 2.1 | Common subexpression elimination — example      | 4  |
| Table 2.2 | Peephole optimization — examples                | 6  |
| Table 2.3 | Code Hoisting — example                         | 8  |
| Table 2.4 | Copy propagation — example                      | 8  |
| Table 2.5 | Closed and open fragments — example             | 25 |
| Table 3.1 | Substring table from suffix tree                | 36 |
| Table 3.2 | Possibilities during candidate examination      | 39 |
| Table 4.1 | Vertical components for candidate in Figure 4.3 | 48 |
| Table 4.2 | Horizontal partitions for example in Figure 4.2 | 49 |
| Table 4.3 | Possible number of partitionings                | 52 |
| Table 4.4 | FMW compaction — % savings                      | 55 |
| Table 4.5 | Heuristic 1 compaction — % savings              | 56 |
| Table 4.6 | Heuristic 2 compaction — % savings              | 57 |
| Table 4.7 | Heuristic 3 compaction — % savings              | 59 |
| Table 4.8 | Best of heuristic — % savings                   | 59 |
| Table 4.9 | “Best of heuristic” improvement over FMW        | 60 |

## List of Figures

|            |                                                                      |    |
|------------|----------------------------------------------------------------------|----|
| Figure 2.1 | Conditional coding — encoder/decoder design                          | 13 |
| Figure 2.2 | Position of compressor in system — High-level subroutine recognition | 15 |
| Figure 2.3 | Algorithm for High-level subroutine recognition                      | 16 |
| Figure 2.4 | Position of compressor in system — Low-level subroutine recognition  | 18 |
| Figure 2.5 | Schematic of mechanism for tailored interpreter.                     | 22 |
| Figure 2.6 | Suffix tree example — “ababaabb\$”                                   | 24 |
| Figure 3.1 | Compaction example                                                   | 28 |
| Figure 3.2 | Candidate instances — one parameter required                         | 29 |
| Figure 3.3 | Candidate instances — turned into a subroutine                       | 30 |
| Figure 3.4 | Instances supporting several procedure mappings                      | 31 |
| Figure 3.5 | Suffix tree for example — prefix traversal                           | 37 |
| Figure 4.1 | One parameter example — $\Omega$ -array form                         | 43 |
| Figure 4.2 | Complicated partitioning — $\Omega$ -array                           | 45 |
| Figure 4.3 | Tieing example — $\Omega$ -array                                     | 46 |
| Figure 4.4 | Algorithm 2 — Finding Vertical Partition Components                  | 47 |
| Figure 4.5 | Finding a valid tieing partition                                     | 50 |
| Figure 4.6 | Algorithm 3 — Find tieing valid for partition instances              | 51 |
| Figure 4.7 | Algorithm 4 — partitioning heuristic 1                               | 55 |
| Figure 4.8 | Algorithm 5 — partitioning heuristic 2                               | 57 |
| Figure 4.9 | Algorithm 6 — partitioning heuristic 3                               | 58 |

## Acknowledgments

Dr. R. N. Horspool is not boilerplated onto this page, and his supervision has helped me learn more of both the craft of research and the communication of it's results. He has constantly preached that technical writing is a skill which must be practiced and perfected, and his suggestions throughout the writing of this thesis have been lessons in themselves. I hope that the reader benefits from my attempts to digest them. I also thank Dr. R. Harrop and Dr. S. Atkins at Simon Fraser University for their continued encouragement and sage advice.

Great hosannas are aimed at whatever force brought Jim Uhl across my path. I thank Jim for always being available for a chat or brainstorming session. On several occasions, his mastery of systems programming explained something that puzzled or frustrated me. He has also explained abstruse concepts to me with great clarity, and this was a great comfort when trying to decipher a paper or book.

Special thanks go to Jan Vitek for his suggestions and enthusiasm, especially since it came at a time when I sorely needed it. Philippe McLean happened to be enrolled in a combinatorics course at the same time I was trying to figure out a combinatorics problem, and I thank him for pointing out the solution in his textbook.

Finally, I thank my wife Susanne. All the books on graduate life maintain that being married enhances one's chance of completing a post-graduate degree. Some people may feel this requirement is a little drastic, but I literally could not have completed this degree without her.

*To Joel and Darrel*

# 1 Introduction

Computer programs that are powerful but small have always been desirable. Research has focussed on gains in speed as internal storage constraints recede into the background.

Increasing chip densities and falling prices are encouraging larger binaries. However, this satisfies no one. Computer programmers and users often find that there is not enough memory. This is partly due to the quality of code produced by modern compilers, but some observers note that the current memory crunch began with the demand for feature-laden software systems [29]. In the end, smaller binaries need smaller system memories. All storage savings are a benefit if they do not result in slower computer systems.

Internal storage is not the only reason to seek smaller code. Distributed systems depend more and more on code coming in from servers via a network. Smaller binaries decrease the time spent in communication. Stand-alone systems need less time to load from an internal disk. This is already done by network hardware that implements compression and decompression through on-board hardware. Most personal computers now use software-based compression to obtain more room on hard drives, but at a cost in performance. Can a program be written to exploit instruction code redundancy with an eye to preserving performance, and rewritten to take less room? The approach is attractive as it eliminates the decompression stage. The compressor output is an executable binary, not a smaller file that would be unreadable without the appropriate decompressor.

We have implied that the code generators in today's compilers focus on increasing speed because memory is not the main issue. This is slightly unfair as some compiler optimizations for speed also result in less code. However, the general observation is that there is a trade-off between space and speed optimizations. Even if attention is paid to space, today's compilers generate code procedure by procedure as interprocedural data flow analysis is still expensive. The code within procedures is of a high quality, but achieving further space optimizations becomes expensive. Systems with multiple modules add other complications during analysis.

Research has shown that compilers generate code sequences that appear many times throughout the final object file [8][21][9]. We can reduce the size of the object file by keeping only one copy of the sequence, and then replacing all other copies with a procedure call. This technique is called *procedural abstraction*, and is one method for reducing the size of binaries. Throughout the thesis, we use the term *compaction* to describe the space savings achieved through procedural abstraction because the smaller binary is immediately executable, thereby distinguishing the results from *compressed* binaries produced by such UNIX utilities as `compress` or `gzip`; the compressed binaries must be *decompressed* before execution.

We expect to see many more close instruction sequence matches than exact instruction sequence matches. By parameterizing the differences, we can replace all copies with one procedure. Most compaction schemes in the literature acknowledge the possibility of using inexact matches by parameterizing the replacement procedures, but because of the cost of extra storage instructions to copy data into and out of parameters, parameters are ruled out. They therefore constrain instruction matches to be exact. However, if instruction sequences are grouped together such that each group is covered by a procedure with the smallest number of parameters, then greater space savings than parameterless schemes are expected. This raises two questions: how do we group instruction matches into procedures without performing an exhaustive search of the solution space? and how much better is the space savings?

Chapter 2 outlines the previous work on reducing the size of executable code. Chapter 3 explains the compaction and parameterization problem in greater detail. Chapter 4 is a treatment of procedure parameterization, including heuristics for assigning similar sections of code to procedures and selecting procedure parameters with an eye towards good space savings. Chapter 5 summarizes our results and suggests future research directions.

## 2 Previous Work

As mentioned in the introduction, compiler optimizations for space have received less attention in recent years than those for speed. There are far fewer published papers for the former. Some of the compiler optimizations for speed also save space, and some of these are reviewed. Other techniques use the results of information theory by choosing instruction and data representations that result in the smallest executable images given the statistical properties of a set of target programs; one such approach will be covered. However, assuming that we do not want to add compiler optimizations that transform and rearrange the internal representation, and cannot entertain the complexity and expense of adding hardware that decodes custom instruction sets, then space savings gains are achieved through procedural abstraction. Sections of executable code that match each other can be replaced by subroutine calls, replacing the many sections with one subroutine body.

### 2.1 Compiler Optimizations Leading to Space Savings

A compiler's back end converts front-end output (the intermediate representation, or IR) into executable code. Early compilers emitted machine code that was far less efficient than hand coded equivalents. Successive compiler writers employed heuristic solutions that improved code quality. These were applied at all compiler stages. Some transformed the source statements and others the machine code, but most concentrated on the IR. In the late 1960s and early 1970s, IR transformations called *optimizations* began to attract more attention, were seriously studied, and, in some cases, were formalized. Most of these optimizations aim to increase the speed of code and, since a guiding intuition is that "the less code there is, the less time is spent executing," space savings are often an extra benefit. Compiler optimizations for modern high performance architectures show that the intuition is not always true, especially in many loop optimizations for code intended for parallel architectures. However, many of the optimizations performed within current compilers continue simultaneously to increase code speed and to reduce code size.

### 2.1.1 Common Subexpression Elimination (CSE)

Ever since it was first published, this optimization has remained one of the most profitable [7]. Source code expressions which appear in different parts of a program or procedure, and whose argument values do not change between appearances, can be replaced by a temporary variable. In the example from Table 2.1, the expression “ $x * y$ ” appears several times in the C fragment. At each appearance of the expression, the values of  $x$  and  $y$  have not changed. Recomputing the expression is unnecessary. It is also a subexpression of “ $x * y * z$ ”. This larger expression, which appears again in the last statement, cannot be replaced by a temporary variable. The last argument,  $z$ , changes value, and the larger expression must be recomputed at “ $f$ ”. At no point are the values of  $x$  and  $y$  changed, so the temporary  $T1$  may be used throughout the example. A machine code version of the transformation is shown. Two machine instructions are saved, or 20% of the unoptimized code.

Table 2.1 Common subexpression elimination — example

|               | Before                                                                                                                                             | After                                                                                                          |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|
| pseudo source | <pre> c = x * y; d = x * y * z; z = 2; e = x * y; f = x * y * z; </pre>                                                                            | <pre> T1 = x * y; c = T1; d = T1 * z; z = 2; e = T1; f = T1 * z; </pre>                                        |
| assembler     | <pre> ld r1, (x) ld r2, (y) mul r3, r2, r1 ld r4, (z) mul r5, r2, r1 mul r5, r5, r4 ld r4, #2 mul r6, r2, r1 mul r7, r2, r1 mul r7, r7, r4; </pre> | <pre> ld r1, (x) ld r2, (y) mul r3, r2, r1 ld r4, (z) mul r5, r3, r4 ld r4, #2 ld r6, r3 mul r7, r3, r4 </pre> |

Although space savings through CSE can be substantial, the optimization must still be carefully applied. For example, some numerical algorithms depend on the rounding behavior of a floating point representation. A small difference between the representation expected by the algorithm and that forced by the compiler may cause an algorithm to

produce incorrect results. This occurs if the algorithm no longer terminates because of the change, or if the change results in an algorithm with an error bound exceeding algorithm specifications [10]. If the rounding method were to be changed after the calculation of  $d$  in the example above, the expression assigned to  $e$  would need to be recalculated. Replacing the second calculation of " $x * y$ " by a load from the temporary variable would then be a modification of the algorithm. This is unacceptable, as any compaction of an executable must avoid changing the meaning of that program.

Another case where savings may be lost occurs if there are significant demands on registers (*register pressure*). As registers are a precious high-speed resource, their allocation is important and difficult. If a CSE transformation stores temporary results in a register, then the number available decreases, leading to situations where the register requirements exceed supply. Spill code is the result, increasing the size of the program [3]. Therefore, if not used wisely, CSE may decrease speed and increase space requirements.

### 2.1.2 Peephole Optimization

This transformation examines and modifies the machine code generated by the compiler [20]. Despite the excellence of code generated for separate functions and procedures, looking at all of the code at one time may expose redundancies and inefficiencies. These usually occur at locations where the code for one procedure ends and another begins. Redundancies are removed by replacing them with equivalent, shorter instruction sequences. In essence, many different optimizations are applied, such as identification of redundancies, "unreachable code", "flow of control" optimizations, but in a very local region and at a low level [1].

A small window (the peephole) passes over the code, limiting improvements to the code visible within the window. A few examples are shown in Table 2.2. Example 1 replaces a load-store pair with a single load (no need to store a value already known to be in memory). Example 2 removes the unconditional branch: the destination is the next instruction. (This code was probably generated for the last option of a "case" statement.)

The final example changes the conditional branch/unconditional branch pair into an equivalent instruction. Static peephole optimizations nearly always result in space savings.

Table 2.2 Peephole optimization — examples

|   | Before                  | After           |
|---|-------------------------|-----------------|
| 1 | ld X, r5<br>st r5, X    | ld X, r5        |
| 2 | br A<br>A: ...          | A: ...          |
| 3 | beq A<br>br B<br>A: ... | bne B<br>A: ... |

### 2.1.3 Unreachable Code Elimination

Programmers often insert debugging statements into code, surrounding them with conditional statements. The conditional expression tests a programmer-specified compile-time constant, and the code is or is not executed at run-time depending on the value of the constant. Some program generators may emit statements with certain parts enabled or disabled, and this usually depends on programmer specified compile-time constants. If the language preprocessor does not support the removal of statements from the stream sent to the compiler (e.g. "#ifdef ... #endif" in C), then the final program will contain sections that will never be executed. This code is called *unreachable*. Eliminating it from the program will not change the output, and may save considerable space [2].

### 2.1.4 Useless Code Elimination

In the 1970s, data flow analysis methods were formalized and used in compilers. Relationships between the values of variables are found through these methods. One example is *live variable analysis* [12]. A variable is said to be live at a specific point in the program if its value is used after that point at least once on some execution path. If the variable on the left hand side of an assignment is not live after that operation, then the computations on the right hand side may be eliminated, including the assignment. Since

there is no longer a need to compute the expression or store its value, space can be saved by omitting both the operation and the assignment from the final executable [1]. Unreachable and useless code is sometimes referred to as *dead code*.

### 2.1.5 Code Hoisting

A generalized form of code motion, it moves *very busy expressions* to a point where space is saved, but not necessarily time. Intuitively, an expression is very busy at a specific point in a program's execution if (a) it will be evaluated on all possible execution paths, and (b) all evaluations occur before any change in the value of an expression argument. That is, there is at least one evaluation of the expression, no matter what control flow path is taken after the very busy point. Instructions required to compute the expressions are removed from each of the evaluation locations, and are replaced there by an assignment from a temporary variable. At the very busy point, one copy of the code for the expression is inserted, and the result is assigned to the temporary variable [1].

In the pseudocode example from Table 2.3, the exponentiation expression “ $c \star\star d$ ” is very busy at a point just before the “*if*” statement. Every possible flow of control from this point to the bottom of the fragment will evaluate the expression. As exponentiation may require many instructions, a space savings occurs when the code is hoisted to the very busy point. Line counts are deceptive in this example: code hoisting is sometimes mistaken for a general form of loop-invariant code motion. However, what is important is that the body of code needed to calculate “ $c \star\star d$ ” appears four times in the code before it is transformed, and only once after.

The results are similar to CSE, but the motivation for code hoisting is space saving, not speed improvement.

### 2.1.6 Copy Propagation

Copy propagation eliminates redundant names for the same value (see example in Table 2.4). In turn, fewer variables may be needed. Although this appears to be the substitution of many variables by one variable, which by itself does not reduce the number of instructions, copy propagation directly enables other space saving optimizations. Register

Table 2.3 Code Hoisting — example

| before                                                                                                                                                                                                                                                                                                                       | after                                                                                                                                                                                                                                                                                                                     |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre> if ( expr1 ) {     switch( expr2 ) {         case 1: a = c ** d;                    break;         case 2: a = c ** d;                    a = a ** 2;                    break;         default: b = c ** d;                    break;     }     a = b * 2; } else {     a = c ** 4;     f = c ** d; } print a; </pre> | <pre> T1 = c ** d; if ( expr1 ) {     switch( expr2 ) {         case 1: a = T1;                    break;         case 2: a = T1;                    a = a ** 2;                    break;         default: b = T1;                    break;     }     a = b * 2; } else {     a = c ** 4;     f = T1; } print a; </pre> |

pressure is lightened as fewer values need to be kept in registers, reducing the contribution of register spill code to the executable's size [3].

Table 2.4 Copy propagation — example

| before                                                                    | after                                                       |
|---------------------------------------------------------------------------|-------------------------------------------------------------|
| <pre> a = x * x; y = a; z = foo( y - 4 ); z = a; b = foo( z - 9 ); </pre> | <pre> a = x * x; z = foo( a - 4 ); b = foo( a - 9 ); </pre> |

### 2.1.7 Leaf-Procedure Optimization

If a procedure does not call any other routines, it is said to be a leaf-procedure because these appear as leaves in the call graph. Several observations lead to space optimizations [3]. For instance, the register which holds the return address need not be saved or restored in anticipation of another procedure call, thereby saving two instructions. Allocating stack

space is unnecessary if no local variables are used, eliminating the code needed to set up the stack frame.

#### 2.1.8 Parameter Promotion

In some languages such as FORTRAN, all procedure parameters use the “call-by-reference” mechanism, and their values are read from memory by the procedure.

Computations need these values, so they are typically held in registers. If the procedure calls other procedures, and passes in some of the same parameters, the memory reads are duplicated. Register demands will increase because the compiler cannot determine that the data is already stored in registers. Why read the same memory locations twice? Why store the newly loaded values back into memory because of register spillage? Instead we can use the register itself to pass in the data to procedures receiving the parameter. Changes to the parameters can be applied to the registers, and saved back to memory when exiting the procedure level that first received the parameter.

Readers interested in detailed examples of *leaf procedure* and *parameter promotion* optimizations are encouraged to refer to “Procedure Call Transformations” section in the survey of optimization techniques for high-performance computers by Bacon, Graham and Sharp [3].

#### 2.1.9 Span-Dependent Instruction Optimization

Unlike RISC opcodes, CISC opcodes vary in size. For instance, branch instructions with nearby destination addresses may need only one or two bytes to indicate a destination address. Far destinations require more than two bytes to specify the address, extending the length of branch instructions to three or four bytes. If code blocks are carefully placed, then the contribution of the branch instruction sizes to execution size can be minimized. Ideally, all branches would require only two bytes, but as this is not always possible, an algorithm exists that maximizes the short branches [27]. The larger the number of short branches, the greater the space savings.

## 2.2 Exploiting Similarity of Code

There was much research into compiler optimizations in the late sixties and early seventies. However, optimization algorithms were still described by their implementations. Identifying common links between different transformations was difficult.

A formulation describing optimization techniques in a machine independent manner was developed by Geschke [9]. It expressed the optimizations through the relationships between source language statements. Statement orderings and dependencies are found: e.g. whether statement B comes before statement A in the source text; whether B is needed by A; and whether the order of computing A and B is invariant. Most optimizations of the day could then be specified and implemented in a programming language via these relations [30].

One of the intriguing ideas in the dissertation by Geschke is the *strongly similar subroutine* optimization. It appears promising, but seems to have not been pursued beyond the dissertation. Two sub-trees in the intermediate representation (which is itself a tree) are arguments for a similarity function. This function's return value indicates the desirability of covering the two sub-trees with the code of one subroutine, with low values indicating good space savings potential. Put another way, the value is an indication of the cost of providing one subroutine body that will implement two different portions of the code. Several similarity functions are presented; one takes the parameter costs into consideration; another returns larger values if the differences between the subtrees occur in the middle of code, and low values if the differences are at the beginning or end.

Finding sub-trees that are close matches and suitable for subroutines is not the same as actually turning them into subroutines. As Geschke phrases it, "what is desirable may not be feasible." Procedural abstraction, or more accurately "procedural extraction", must build the subroutine such that space and/or time of the emitted code is lower than if the subroutine was not formed. Even though the dissertation discussed the promise of the approach, the optimization was not included in a subsequent related work [30]. Other

researchers have picked up on the idea of procedural abstraction, but there are still other ways of exploiting code redundancy to save space.

### 2.3 Matching Code Representation to Code Properties

Compiler optimizations and procedural abstraction transform the executable code such that it requires less space to store and/or less time to execute. If the instruction decoding hardware can accept instructions with sizes varying by units of a bit, then there are additional possibilities. An information theoretic approach by Hehner [13] analyzes and recodes executable files based on the statistical properties of a sample of programs. These transformed files are smaller than the originals, but they require special hardware for instruction and data fetches (usually a custom decoder) and/or run-time support (interpretive execution).

Part of the method revolves around finding the fewest number of bits necessary to represent the opcodes. A simple scheme examines a sample program set, finding the frequency of each opcode's use. Frequencies are turned into probabilities, and these in turn may be used to find the smallest possible encodings for the opcodes, such as those provided by Huffman encoding [15]. Frequently occurring opcodes are represented by fewer bits than those which are rarely used. The instructions are re-encoded, a dictionary of encodings attached, and a smaller executable is the result.

However, instructions are usually not randomly distributed through an executable. Analysis of sample programs will show that some instruction pairs appear more frequently than others. In some architectures, one such pair may be a load into a register before an arithmetic operation. An analogy can be taken from the spelling of English words: if the letter "q" occurs in a word, then there is a high probability that a "u" is the next letter. Successors with a high probability are encoded in fewer bits than those with low probability.

Probabilities may be found for  $m$ -tuples of instructions (for some fixed  $m > 1$ ), and used to find minimally sized encodings of the tuples. Re-encoding occurs as before. Taken to its extreme,  $m$  may be the number of instructions in the executable, in which case the whole

program can be represented by one opcode, but the dictionary will have only one entry — the program itself. There would be no space savings. The value of  $m$  must be chosen such that the size of the program's new encoding plus the size of the dictionary of encodings is minimal.

Hehner examines two methods that exploit conditional probabilities between opcodes: iterative pairing and conditional coding.

**Iterative pairing** makes repeated passes over the executable. Candidates are pairs of opcodes that, when replaced by a single new opcode, provide a space savings over and above a limit specified by the user. At each pass, the best compression candidate is found. (The pairs may be generalized to  $m$ -tuples.) When no more candidates exist, iteration stops. Having created an augmented instruction set — the original opcode set along with opcodes for compressed pairs — the algorithm recodes the executable, appending a dictionary of the new opcodes.

A more aggressive algorithm recognizes that the final encoding may represent a local maxima of savings in the solution space of compression. By continuing the pairings where the previous algorithm left off, the compression of the candidates will *increase* code size. However, new candidates with excellent space savings may eventually appear. The final result could be a better compression of the executable than if the search for candidates had stopped in accordance with the original algorithm.

**Conditional coding** (see Figure 2.1) compresses by using  $m$ -tuples, for which conditioning classes of order  $m-1$  conditional probabilities are needed. In the figure, the instruction to be encoded is in the operation register (upper right). Previous instructions are in the context register (upper left), with the register acting as a sliding window over the code. When the data from these two registers are passed to the encoder, the minimum redundancy encoding of the  $m$ -tuple is placed in the code register (bottom) and is output to the new executable file.



Figure 2.1 Conditional coding — encoder/decoder design

Let the probability that  $inst_i$  follows  $inst_{i-(m-1)}$ ,  $inst_{i-(m-2)}$ , ...,  $inst_{i-1}$  be equal to  $p_i$ . The size of code  $c_i$  corresponds to the number of bits required to represent the opcode in the context given, which is [6]:

$$-\log_2 p_i$$

An encoding whose size is equal to this number of bits is a minimal redundancy encoding. The decoding operation is similar to encoding, and the role of registers in decoding is indicated in parentheses in the figure.

Experimental results were obtained by compressing the code of the IBM 360 version of XCOM, a compiler for the XPL programming language. An earlier transformation of the code changed instruction operands from a fixed sized to a variably sized representation. Using iterative pairing, the re-encoding of the XCOM code resulted in a 77.4% reduction in the space required for opcodes. (Data such as addresses and constants were encoded using another space saving algorithm.) Conditional coding using 4-tuples yielded a 80.2% savings in opcode space.

These results do have a cost. It is difficult and expensive to design and implement an instruction decoder for opcodes which vary in size down to a bit. Speed is required; therefore the decoding logic must be implemented in hardware. However, until the late 1970s, when internal memory implementations switched from magnetic core technology to integrated circuits, significant reductions in memory requirements of programs were worth

the cost of increased decoder complexity. As well as being useful for reducing the size of executables, this research also helped computer architecture design. It applied rigour to the selection of instruction sets and opcode sizes, analyzing the properties of programs that are expected to run on a specific machine design.

## 2.4 Compilation to Compact Code

Using PL/I on an IBM System/370 as the experimental platform, Marks examined three code compaction methods [21]. These approaches compress executables by replacing long code sections with short subroutine calls.

*High-level subroutine recognition* identifies redundancy in a programmer's source. The analysis uses the internal representation available immediately after lexing and parsing, with results matched against the original source statements. Each subroutine candidate, which is made up of source language statements, is assessed by the programmer who must then accept or reject it. *Low-level subroutine recognition* identifies repeated sections in the object code where the repetition has been introduced by the actions of the compiler. Compilers tend to produce the same instruction sequences over and over, and the method exploits this. Both the high- and low-level approaches modify some representation of the program, but after code generation, the smaller executables are loaded and run by the computer without any extra processing.

*Tailored interpretation* takes place at the same level as low-level subroutine recognition, but takes advantage of very small sequences rejected as "not profitable" by the low-level scheme. The original object code sequences are replaced by custom opcodes, and the compressed code must be interpreted at run time. This produces very high space savings but at a great cost in speed.

### 2.4.1 High-Level Subroutine Recognition

To improve productivity, some programmers often write code with the assistance of tools such as `lex`, `yacc`, and others that generate source language stubs handling communication or message passing details for some development environments. The

programmer saves time, and the generated source code is guaranteed to match the programmer's specifications. If these tools are used several times during development, the same source code sequences might turn up several times. Programmers also repeat themselves when writing their own code, using idioms, and often within the same procedure. High-level subroutine recognition takes advantage of these types of repetition. In many ways it is easier to compress at this level as we are still some distance away from the complexity of the machine code. Ease in finding candidates comes at a cost, as they tend to be short sequences of source lines, and replacing these sequences by subroutine calls is not very profitable. Nearly identical sequences could be covered by one subroutine, but the exact cost of parameterizing a routine is difficult to calculate because of the distance from the machine code. Figure 2.2, shows the position of the compressor within the compiler.



Figure 2.2 Position of compressor in system — High-level subroutine recognition

**Subroutine recognition:** The PL/I compiler's *n-address* Object-Machine Independent Representation (OMIR) was used by Marks. In its search for candidates, the algorithm treats the *n-address* form as a string of operators and operands. That is, each operator and operand is a "character", and the whole program is one long string. Substrings match if they:

- agree in all their operators (if substrings *A* and *B* belong to the same candidate, then the *i*<sup>th</sup> operator "character" in *A* matches the *i*<sup>th</sup> operator "character" in *B*, for all *i*); and

- differ in at most three other “characters” (this allows for up to three parameters).

```

for each operator Op do
  M[] := empty;
  P1 := first occurrence of Op in the OMRI;
  while there is a position P2 after P1 in the OMRI do
    m := LongestPossibleMatch( P1, P2 );
    add P1, P2 to list M[m];
    P1 := P2;
  end while;
  find the best subroutine candidate B, and query the programmer;
  if programmer accepts B then
    make subroutine based on B;
    replace OMIR instructions matching positions in B;
  end if;
end for;

```

Figure 2.3 Algorithm for High-level subroutine recognition

For each operator in the instruction set, Figure 2.3’s algorithm examines all substrings that begin with that particular operator. Since operators must match, constraining the start of substrings to operator positions is a useful optimization. Pairs of positions are taken together, and the longest possible matches (using the two-part definition from above) that start from these positions are recorded.

**Evaluating candidates:** After searching for all matches, the algorithm finds the best subroutine candidate from the matches that begin with a particular operator. The programmer is allowed to accept or reject this candidate after seeing the line numbers corresponding to the suggested subroutine. If accepted, the subroutine body is added to the end of the program, and all matches in the OMIR are replaced by a subroutine call, each call including the appropriate parameters.

All operators are examined in turn. To keep the algorithm simple, influences between subroutine choices are ignored — the effect of compressing one subroutine before another is not examined. Marks does not indicate how much potential space savings are lost by this design decision.

Any manipulation of the source text is avoided even though rearranging the text may improve the opportunities for compaction. One reason given for this is that it requires flow analysis, but surely this should be simple for  $n$ -address code. A more convincing reason against manipulation is that rearrangement might increase the distance between the source code and the transformed OMIR. The greater the distance, the more difficult it is for the programmer to assess the effect of accepting the subroutine. Marks notices that there are some situations where rearrangements are overkill. For example, two sections may differ where one contains an extra statement in the middle. Splitting the sections into two is the simplest solution, with similar code before and after the statement forming the new subroutines. A set of heuristics such as this could identify and exploit similar situations without the need for data flow analysis.

**Results:** The space savings of this scheme are poor — 1% to 2%. No mention is made of the influence of PL/I, or the intermediate representation, on the results. It is unclear how the results would transfer to languages currently in use, such as C or C++. One suggestion is that the programmer's time may be better spent rewriting or redesigning the code than in accepting or rejecting compaction candidates! At this level, automatic methods of accepting subroutine candidates were not examined.

#### 2.4.2 Low-level subroutine recognition

Intuition leads us to expect more code repetition at the machine level than higher up in the source code. Marks follows this thinking by applying the high-level subroutine recognition algorithm to a version of machine code that he called *low-level code*. Features of this representation include instruction lengths that correspond to the final code lengths. Branch destinations are not resolved, but since the lengths of final instructions are known, estimates of space savings are accurate. Registers have already been allocated, and their reassignment would be expensive. Arguments must be passed into routines via registers, but if all registers are tied up, then subroutine parameterization is expensive. To avoid any register reassignment during subroutines calls, parameters are not allowed.



Figure 2.4 Position of compressor in system — Low-level subroutine recognition

**Subroutine recognition:** The algorithm differs slightly from that in Section 2.4.1. Before finding matches, a hash chain is constructed for each opcode, linking together all positions where that opcode occurs. This reduces the time spent searching through the code. Each hash-chain is examined in turn, with matches performed on every pair of positions from the chain. Only identical code matches are accepted, although extra processing is added to examine "if—then—else" constructs. After a match is made, the pair of opcode positions, the length of the match, and the opcode value itself are formed into a group.

When all hash-chains have been processed, transitive closure merges groups together. After the merge, each group contains all the code positions that start with the same opcode and match to the same length. The groups become subroutine candidates.

**Evaluating candidates:** Groups are placed into a list, and the list is ordered by each group's space savings potential. Those groups corresponding to subroutines which are too short or too infrequent are eliminated as they result in little or no space savings.

Overlapping and nested groups are retained even though some compaction choices become difficult.

To achieve optimal compaction, searching through all the possible orderings of subroutine construction would be necessary. The number of orderings increase if groups are allowed to be broken up. New compaction possibilities appear when additional groups are formed through the splitting of overlapping groups. Deciding whether and which nested routines should be compressed will make analysis even more difficult. A moderately sized program would end up needing a lot of computing power during compaction. Since finding an optimal solution is too costly, a greedy heuristic is used, which, among other things, does not consider the effect of subroutine choices on the speed of the final code.

Subroutine selection begins by taking the group at the head of the list and adding it to a collection of "best groups." Proceeding down the list, groups are added to the "best" collection in the order they appear in the list, but with some constraints. None of the groups in the collection are allowed to be nested, and no groups may overlap. Marks pointed out that it is short sighted to eliminate less profitable groups because they overlap a more profitable group. The algorithm is searching for the best net gain: if rejecting the good group and accepting the overlapping bad groups leads to a better net gain, then this is the decision taken. Different orderings of the original list (best gain to worst; worst gain to best; randomized) do not result in great variations in the final space savings.

The possibility of using parameters in subroutines was raised, but it was suggested that the benefit from this depended on a machine's architecture. If registers are used to pass parameters into subroutines, then these expensive resources are not available to the rest of the code. A result would be poorer overall code quality for the rest of the program. This is reasonable given that register allocation is more difficult when some registers have already been used for parameter passing. Any other parameter passing mechanism would be more space intensive, wiping out some of the benefits of the space savings.

**Results:** Reported compaction ratios were encouraging, with space savings ranging from 5.0% to 20.1%, averaging out at 15%. Compressed code executes 15% slower. Larger programs were better compressed, but the correlation between compaction ratios and low-level code size was non-linear. More efficient instruction matching techniques would

reduce the contribution of compaction to compile times. No numbers were given for the potential loss in compaction if parameters were allowed—it may indeed depend on the machine architecture, but a rough estimate would have been helpful.

#### 2.4.3 Tailored Interpretation

After examining many code examples, Marks observed that groups rejected during low-level compaction could be used, but only if a cheaper subroutine calling mechanism could be found. This is intuitively appealing. We expect many matches that are two or three instructions long, many more than those that are, say, ten instructions long. However, all of the subroutine calling mechanisms require space. As the size of the average subroutine gets smaller, the limit to compaction becomes the size of the calling sequence.

A solution to the calling sequence problem is to *tailor* the instruction set, matching the characteristics of the code. The tailored set is implemented through software. This is similar in spirit to Hehner's "iterative pairing" [13]. Marks acknowledges the influence of the idea, but instead of putting the effort into the hardware encoder/decoder, he puts it into an interpreter.

Opcodes not used by the program take on a new meaning, by becoming indices into a table of the subroutines. The subroutines are unique to each program. Compressed code sizes include the space needed for the compressed code, the interpreter, and its tables.

Figure 2.5 is a schematic of the system as it interprets one instruction.

Compressed code will now be made up of three different types of instructions:

- Ordinary machine instructions executed directly by the hardware with no interpretation.
- Parameterless subroutines.
- Parameterized subroutines.

Arguments for the last instruction type are placed after the opcode, with the interpreter copying the argument value into the subroutine code. As shown in Figure 2.5, R2 in the unparameterized code corresponds to the parameter position in the parameterized routine. The grey data value from the compressed code's instruction is copied into the routine. A

maximum of one parameter per subroutine is allowed, a reasonable requirement considering the small size of the subroutines.

**Subroutine recognition:** The method is taken from low-level compaction, but with the exception that matches may allow one operand position to differ. Nested subroutine calls are also permitted.

**Evaluating Candidates:** The selection of candidates is more complicated because of parameters and nesting, and is implemented through a multi-pass algorithm:

1. A list of groups is created in order of descending space savings potential.
2. Groups are considered in list order, provided that the group (a) has the largest space savings of those lower in the list and (b) does not overlap a selection with a greater space savings. As each group is selected, it is converted into a subroutine. Occurrences in the main program are replaced by a call to the new subroutine using the tailored instruction set.
3. If groups were rejected because of step 2, repeat the algorithm from step 1.

After all possible groups are compressed, the interpreter table will be complete, and the compressed code is saved along with the table and the code for the interpreter itself.

**Results:** At 50%, the compression ratio is impressively high, and is due to the use of parameters and to the utilization of smaller (two or three instruction) matches. Compression in one test case could have exceeded 50%, but the lack of unused opcodes imposed a limit. However, the code may run up to 15 times slower, and is an indication of the cost of interpreting each tailored instruction. Not all instructions depend on the interpreter — operating system calls and noninterpreted subroutines run at full System/370 speed. If space savings are needed, but specific sections of the code must run at full speed, then the "full speed" code could be tagged, and the compiler would not compress the tagged code.

## 2.5 Analyzing and Compressing Assembly Code

In a conference paper by Fraser, Myers and Wendt [8], code is compressed at the same level as Marks' low-level algorithm. The target machine is a PDP-11, and the compressor runs



Figure 2.5 Schematic of mechanism for tailored interpreter.

on a VAX 11/780, but there is no description of the high-level language or its compiler. Compaction occurs at the point after the generation of assembly code and before the assembly. Their technique (abbreviated to "FMW") differs from that of Marks in:

- the algorithm for subroutine recognition, based on the construction of a *suffix tree*;
- and
- the algorithm for identifying and classifying candidates as *open* or *closed*, where the cost of converting the former into subroutines is less than that for the latter.

**Subroutine recognition:** The major computational effort during compaction takes place while finding and evaluating subroutine candidates. Borrowing an idea from text compression, Fraser et al. reduce the search costs by building a suffix tree. While there are many efficient string searching algorithms, repeated substring searches over the same string benefits from an auxiliary index to the main string. The suffix tree is an example of such an index (see Figure 2.6 for an example).

A substring is described by the path from the suffix tree root node to an inner node or to a leaf node. The text of the substring is formed when the path edge labels are concatenated together. If an inner node has  $m$  leaf nodes descendants, then there are  $m$  occurrences of the substring described by the path from the root. Leaf nodes hold the starting positions of these substrings, and there are as many of these nodes as there are characters in the main string.

For example, the suffix tree in Figure 2.6 has been built for the string "ababaabb", which is terminated by the end-of-string symbol "\$". The path of heavy edges from the root node to a leaf corresponds to the substring "abb", which starts at position 6 in the main string. Three heavily circled leaf nodes containing 3, 1 and 6 are the leaf node descendants of the shaded internal node. Notice that the path from the root to the shaded node constructs "ab". This short substring appears three times in the main string, starting at positions 1, 3 and 6. Suffix trees for assembly language program are much larger, and require algorithms that build the index in linear time and reasonable (polynomial) space [19]. As it contains indexes to all substrings, we can also get answers to such questions as, "What is the longest



Figure 2.6 Suffix tree example — “ababaabb\$”

substring that occurs in  $k$  places?” [25] Other researchers have used the suffix tree as the basis for regular text compression [23].

By treating the assembly language text as one string, substrings with several occurrences correspond to potential subroutines. Assembler directives are flushed directly to the compressed file as these must be left unmodified. For the rest of the assembly code, the authors restrict the possible matches in the text: two instructions at different positions are the same if the instruction texts are identical. Instructions that differ by one or more operands are rejected, so this technique does not allow for parameterized subroutines. This restriction is relaxed for branches, allowing for the possibility that different labels may refer to the same destination.

**Evaluating candidates:** After finding the substrings, or “fragments,” each is evaluated for subroutine potential. As with Mark’s low-level subroutine recognition, infrequent or short fragments are discarded. The remaining fragments are grouped into two types. *Open fragments* end with a non-relative branch. When turned into subroutines, they are accessed by an unconditional branch, which is inexpensive. *Closed fragments* exit to the instruction following the fragment. To preserve the original control flow when using closed fragments, only the subroutine call may be used to access its corresponding routine. (See Table 2.5 for examples of open and closed types, where fragments are in boldface.) Because of the extra

instructions needed when converting to a subroutine, closed fragments will not always result in a space savings. An *evaluation function* estimates the space savings potential of a fragment, taking into account these restrictions. Fragments of either type that produce “negative” savings are discarded.

Table 2.5 Closed and open fragments — example

| closed                                                                                                                                       | open                                                                                                                                                       |
|----------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre> ... push D, [sp] ld r1, A ld r2, B mul r3, r2, r1 st r3, B ... push E, [sp] ld r1, A ld r2, B mul r3, r2, r1 mul r4, r4, r3 ... </pre> | <pre> ... st B, r6 push D, [sp] push E, [sp-4] br L5 L1: add r3, r3, r4 ... sub r2, r1, r4 push D, [sp] push E, [sp-4] br L5 L2: add r4, r1, r1 ... </pre> |

A fragment’s control flow *falls off the end* if its last instruction is not an unconditional branch or subroutine return. *Falling through to a fragment* occurs when it is not accessed via a branch or a subroutine call. Since open fragments are accessed by an unconditional jump or a fall through, they cannot exit by falling off the end. Relocating the fragment’s routine will change instruction order as we cannot guarantee that the code following the routine has not changed. Closed fragments are entered by a subroutine call, so falling into them or entering them by a branch will result in incorrect code. Open and closed fragments must contain the destination of any internal relative branch, otherwise the flow of control is not guaranteed to reach the exit of the fragment. Overlapping fragments are discarded. Finally, closed fragments must pop off the stack exactly what they pushed on. Without this final restriction, the return address may not be on the top of the stack at fragment exit (subroutine return).

An optimal algorithm is not used to choose fragments. Instead, a greedy algorithm orders the fragments by the value of the evaluation function, examining them in order of

decreasing space savings. If a fragment satisfies all criteria, it is converted into a subroutine. The code for the body is placed out-of-line. All occurrences of the fragment in the assembly text are replaced by unconditional branches (open fragment) or calls (closed fragment) to the subroutine. The process is repeated until there are no fragments left. Different fragment orderings may result in better compaction, but such orderings were not investigated by the authors.

**Results:** After applying the compaction algorithm to a set of over 150 UNIX utilities, the authors reported space savings ranging from 0% to 39%, averaging at 7%. The compressed binaries took 1 to 5% more CPU time, but this was offset by reduced loading time, and the result was a 11% time savings. However, the wide range of compaction ratios raises questions. Why does some code compress well? Does it matter if programs are large or small? It is unclear how well the figures would hold for instruction sets of architectures currently in use. Much has changed since 1984, both in machine architectures (RISC, superscalar) and compiler technology (instruction scheduling, software pipelining). When compared to the results of Marks's low-level scheme, the average of 7% is poor. This may be due to the size of UNIX utilities: if applied to the PL/I compiler code, the average compaction might improve. How much the FMW algorithm would benefit from parameterized subroutines is unclear.

## 2.6 Looking Forward

There are three parts to the code compaction problem:

- finding subroutine candidates; and
- evaluating subroutine candidates; and
- choosing subroutine candidates.

Greedy algorithms solving the last problem were used in the two previous papers. They were used despite the possibilities for better space savings possible through careful ordering of subroutine construction. The impact that small fragment differences have on compaction rates was ignored. Solutions to the second problem, when applied at the level of assembly code, could not deal with candidates that needed parameters.

The first problem is solved efficiently through the use of suffix trees, and this is what we use to find subroutine candidates. We are interested in seeing the effect on code compaction on a search of the solution space of subroutine parameterizations, which we investigate in Chapters 3 and 4. A Sun 4 SPARC workstation was the platform for our experiments and our algorithm evaluation. UNIX utility binaries were used as input for the compaction algorithms. (The specific UNIX utilities used in the FMW paper were not specified, but that will not rule out comparisons with their results). Since the executable code is disassembled, no compilers need to be modified: all compaction is performed on code that is already loadable and executable. Chapter 5 summarizes the results; it suggests possible improvements and future research directions.

### 3 Compaction and its Problems

*Procedural abstraction* replaces many instructions with a call to one procedure. This achieves *compaction*, which is a transformation of a program into a smaller binary that is immediately loadable and executable. The compacted binary does not need “uncompaction” nor run-time support (e.g. interpreters as in Marks “tailored interpretation”). Since “similar sections of code” is the intuition behind procedural abstraction, we can visually identify repeated appearances of the same code, thereby easily developing transformations leading to space savings. For instance, the left-side program in

| before                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | after                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre> 2020: clr %fp 2024: ld [ %sp + 0x40 ], %o0 2028: add %sp, 0x44, %o1 202c: sll %o0, 2, %o2 2030: add %o2, 4, %o2 2034: add %o1, %o2, %o2 ... 207c: sethi %hi(0x2000), %g1 2080: call 0x16090 2084: ld [%10+0x220], %o0 2088: ld [%10+0x220], %o1 208c: add %o0,%o1,%o0 2090: neg %o1, %o1 2094: ld [%g1 + 0x200], %g1 ... 20e4: or %o3, 2, %o3 20e8: call 0x16090 20ec: ld [%10+0x220], %o0 20f0: ld [%10+0x220], %o1 20f4: add %o0,%o1,%o0 20f8: neg %o1, %o1 20fc: mov %16, %o4 ... 23d4: and %o2, %o1, %o1 23d8: call 0x16090 23dc: ld [%10+0x220], %o0 23e0: ld [%10+0x220], %o1 23e4: add %o0,%o1,%o0 23e8: neg %o1, %o1 23e8: or %17, 0x248, %17 ... </pre> | <pre> 2020: clr %fp 2024: ld [ %sp + 0x40 ], %o0 2028: add %sp, 0x44, %o1 202c: sll %o0, 2, %o2 2030: add %o2, 4, %o2 2034: add %o1, %o2, %o2 ... 207c: sethi %hi(0x2000), %g1 2080: call 0x17000 2084: ld [%g1 + 0x200], %g1 ... 20d4: or %o3, 2, %o3 20d8: call 0x17000 20dc: mov %16, %o4 ... 23b4: and %o2, %o1, %o1 20b8: call 0x17000 23dc: b 0x23e4 ... 17000: call 0x16090 17004: ld [%10+0x220], %o0 17008: ld [%10+0x220], %o1 1700c: add %o0,%o1,%o0 17010: neg %o1, %o1 17014: return </pre> |

Figure 3.1 Compaction example

Figure 3.1 is an abbreviated printout of a disassembled binary, and similar sections of code

are highlighted by boldface type. On the right-hand side, the boldfaced code from the left has been replaced by subroutine calls, and one copy of the body attached to the end of the program.

Space savings are modest: 15 instructions are removed from the original, and 9 are added, resulting in a net savings of 6 instructions. Each of the replaced segments of code are identical and need no parameters. If many such transformations are applied, substantial savings may result.

### 3.1 The Problem

The FMW technique has already handled the "no parameter" case. What happens to compaction if the sections of code are similar but not identical? We have copied some similar sections of code from a program and placed them in Figure 3.2. Their differences are highlighted by boldface type. In the same manner as a high-level language procedure call, we pass in the differing value to the new routine via a *parameter*. Figure 3.3 shows the result of procedural abstraction, including the copying of values into the parameter passing variable. The procedure body has been placed in the region beyond the "end of text" of the original program.

|                         |                         |                         |
|-------------------------|-------------------------|-------------------------|
| 2540: 1d [%00+0x30],%00 | 25f0: 1d [%00+0x30],%00 | 263c: 1d [%00+0x30],%00 |
| 2544: mov 0x9400,%01    | 25f4: mov 0xa000,%01    | 2640: mov 0xa000,%01    |
| 2548: mov \$14,%05      | 25f8: mov \$14,%05      | 2644: mov \$14,%05      |
| 254c: b 0x16210         | 25fc: b 0x16210         | 2648: b 0x16210         |
|                         |                         |                         |
| 25d0: 1d [%00+0x30],%00 | 2610: 1d [%00+0x30],%00 | 2800: 1d [%00+0x30],%00 |
| 25d4: mov 0xa400,%01    | 2614: mov 0xa400,%01    | 2804: mov 0xb000,%01    |
| 25d8: mov \$14,%05      | 2618: mov \$14,%05      | 2808: mov \$14,%05      |
| 25dc: b 0x16210         | 261c: b 0x16210         | 280c: b 0x16210         |

Figure 3.2 Candidate instances — one parameter required

In Figure 3.2 it is easy to find the number and the positions of parameters by simply comparing operand values for all instances, one operand position at a time. If two instances have different values at one position, then a subroutine covering the instances needs a

```

2540: mov 0x9400,%g4      25f0: mov 0xa000,%g4      263c: mov 0xa000,%g4
2544: call etext+0x200     25f4: call etext+0x200     2640: call etext+0x200
25d0: mov 0xa400,%g4      2610: mov 0xa400,%g4      2800: mov 0xb000,%g4
25d4: call etext+0x200     2614: call etext+0x200     2804: call etext+0x200

```

```

etext+0x200: ld [%0+0x30],%00
etext+0x204: mov %g4,%01
etext+0x208: mov %l4,%05
etext+0x208: b 0x16210

```

Figure 3.3 Candidate instances — turned into a subroutine

parameter at that location. Our example above uses one parameter for the *first* operand of the *second* instruction. As more instances are covered by one subroutine body, the compaction rate improves. One way of covering many instances by one body is by passing in the differences to the procedure. In our example, this is done for differences in operand values, but it could also be applied to differences in instruction operations. This thesis will concentrate on the former.

The compaction problem reveals a trade-off between coverage and parameters: if a procedure covers too many instance differences then parameter costs will erase space savings, but allowing no parameters can eliminate many opportunities for compaction. We propose a compromise by reducing the number of parameters by covering the instances with *several* procedure bodies (not just one), mapping instances to procedures in such a way that the number of parameters lead to costs that are tolerable.

For example, covering all five instances in Figure 3.4 by one procedure would require many parameters. However, there is no need to turn all of them into procedure calls. Excluding E, we could group together the others (A, B, C and D) and cover them with one procedure. Operand values that are identical across each instance require only one parameter, thereby reducing the number of instructions needed for parameter copying. Another grouping takes advantage of the fact that instances B and C are identical, and covering these two by a procedure will not require parameters: excluding A, D and E from

```

A
587c: sethi %hi(0xa800), %o7
5880: ld [ %o7 + 0xc8 ], %o7
5884: sll %o5, 2, %o5
5888: st %o0, [ %o7 + %o5 ]
588c: sethi %hi(0xa000), %o0
5890: ldsb [ %o0 + 0x2ec ], %o0
5894: tst %o0

B
5b70: sethi %hi(0xa800), %o3
5b74: ld [ %o3 + 0xc8 ], %o3
5b78: sll %o1, 2, %o1
5b7c: st %o0, [ %o3 + %o1 ]
5b80: sethi %hi(0xa000), %o4
5b84: ldsb [ %o4 + 0x2ec ], %o4
5b88: tst %o4

C
5bec: sethi %hi(0xa800), %o3
5bf0: ld [ %o3 + 0xc8 ], %o3
5bf4: sll %o1, 2, %o1
5bf8: st %o0, [ %o3 + %o1 ]
5bfc: sethi %hi(0xa000), %o4
5c00: ldsb [ %o4 + 0x2ec ], %o4
5c04: tst %o4

D
5ce0: sethi %hi(0xa800), %o7
5ce4: ld [ %o7 + 0xc8 ], %o7
5ce8: sll %o5, 2, %o5
5cec: st %o0, [ %o7 + %o5 ]
5cf0: sethi %hi(0xa000), %o2
5cf4: ldsb [ %o2 + 0x2ec ], %o2
5cf8: tst %o2

E
65d4: sethi %hi(0xa800), %o5
65d8: ld [ %o5 + 0xc8 ], %o5
65dc: sll %o5, 2, %o4
65e0: st %o0, [ %o5 + %o4 ]
65e4: sethi %hi(0xa000), %o7
65e8: ldsb [ %o7 + 0x2ec ], %o7
65ec: tst %o7

```

---

Figure 3.4 Instances supporting several procedure mappings

compaction may or may not produce a space savings. These combinations must be examined and evaluated before making a compaction decision.

Finding the best space saving requires looking beyond exact matches. Through clever grouping of instances and use of procedure parameters, we can outperform the “exact match only” FMW algorithm. Our task is to find these groupings without an exhaustive search. In the next chapter we will present and evaluate different heuristic solutions, but first there are some preliminary details to be covered. The rest of this chapter covers the disassembly and procedure search phases of our compaction scheme.

### 3.2 Preliminaries to Procedure Searching

Compaction through procedural abstraction changes the order of machine instructions within the binary file, but does not alter the result of computation. Marks [21] and Fraser et al. [8] compact the assembly language representation of the program. Many compilers

already translate source code into this representation (e.g. `gcc` and `lcc`), passing it on to an assembler as the last phase of processing. Therefore, compaction could be added as part of a compiler.

In the previous two approaches, compaction can only be performed if the source code is available (i.e. “no source = no compaction.”) A disassembler avoids this restriction. As it not only makes available the assembly code from the binary and processes all of the code linked together as a multi-module system, disassembled instructions could correspond exactly to the code found in the binary. What we see (the binary) should be what we get (the disassembled code).

Disassembly has one drawback: separation of code from data is not always possible. The best we can do is find specific sections of the binary are guaranteed to be code. Symbol table information, data from binary interface headers, and knowledge of compiler conventions provide the precise locations of some instructions. However, it is expensive to look for code beyond the information from these sources. Even with techniques such as abstract execution of code, it is still impossible to guarantee that all instructions have been found [14]. Thankfully, the output from the disassembler using only symbol table information is still quite large. Rather than worry about increasing the coverage of disassembled instructions, a economical solution in this situation is to compact what is guaranteed to be code, and leave the rest unmodified.

### 3.2.1 Disassembly

The disassembler from GNU’s source-level debugger `gdb` was used after several simple modifications [22][24]. `gdb`’s code is very portable, supporting many different hardware platforms, and each supported platform has a corresponding machine description file. Whereas Sun 4 SPARC’s description consists of 1300 non-comment, non-blank lines of code, machines with more complicated instruction sets (e.g. Motorola 68030, Intel 80486) have larger descriptions.

SPARC’s fixed instruction alignment on 4-byte boundaries makes possible a one pass disassembler, and because symbol tables are not absolutely necessary, stripped SPARC

binaries may be compacted. There may still be occasions during disassembly where function entry addresses, data locations and other information from the symbol table would make it easier to distinguish between code and data. On CISC machines with varying instruction sizes, the symbol table is a necessity. There would be little progress without the table as a given section of the binary may be disassembled several ways depending on which address is the start of an instruction. Entry points and function addresses from the table can resolve this type of ambiguity.

Only SPARC binaries were compacted in this project. We assumed that the symbol table information was not available. However, if the binary is not stripped, then function addresses in the symbol table must be updated to reflect the positions of the functions in the smaller binary.

### 3.2.2 Separating Code from Data

The binary code was mapped into memory, and every instruction separately disassembled by a call to `print_insn_sparc()`. Arguments for the procedure include the address of the instruction to be examined. The results returned are the readable text of the instruction and the address of any branch destination. *Illegal* instructions are those which the disassembler does not recognize and are automatically marked off as data. After the binary is disassembled, all basic block information is examined, and sections that do not fall within a basic block are also marked as data.

Some data sections may contain jump tables. Since compaction will change the position and location of code, the table entries may need updating to reflect the addresses of the moved jump targets. Both Sun's C compiler `cc` and GNU's C compiler `gcc` place the table immediately following the jump instruction. Finding the length of the table helps separate code from data. Instructions indicated by `cc` jump table entries precede the code that performs table lookups. For `gcc`, the instructions are located immediately following the last entry in the jump table. However, compilers for other languages may position instructions and tables according to different conventions, so the problem of finding jump table locations may prove difficult. One solution using *program slicing* has been described

in the literature, and does not depend on knowledge of specific compiler implementations. Program slicing finds the subset of instructions which determine the value of the indirect branch register. From this subset, the location of the jump table is usually deducible [18][28]. However, we make the simplifying assumption that the binaries given to the compactor were produced by either `cc` or `gcc`.

Delay slots on RISC architectures add a complication. Some compilers move into this slot the instruction immediately preceding a branch or jump. Therefore a fragment containing a branch or call must keep the branch instruction and its delay slot in the same procedure. However, there are cases where a delay slot instruction is also a branch destination, which usually occurs when two code blocks are fused together by an optimizer (a space saving optimization). If the slot belongs to a branch instruction, and if that branch is the exit from a procedure, then a simple solution to the problem is to make another copy of the delay slot instruction, keeping one copy in the procedure and the other within the original program [5].

### 3.2.3 Difficult Code

Several classes of code are difficult to compact because it may not be possible to guarantee the correctness of their compacted versions.

*Self-modifying code* treats some address contents as both code and data. For example, a *direct dynamic linker* patches in the address of the called subroutine at run-time by modifying the program [16]. Self-modifying schemes usually assume a fixed distance between the instruction to be changed and the instructions performing the change. Moving either instruction will overwrite some other address. Not only is it difficult to recognize self-modifying code, but it will also be hard to ensure that it will work after compaction.

*Arithmetic on branch destinations*, usually performed on indirect branches, requires some data flow analysis for detection. Jump table lookups are easy to recognize, but other arithmetic on registers might not be used until some instructions later. Determining the possible destinations using some data flow analysis and arithmetic is necessary since there is the potential of some destinations located in sections considered data which should now

be viewed as code. So far, no code that has been disassembled has contained indirect branches of this sort, and we assume that this will not change.

Some programs *calculate an address based on the checksum of the instructions*, usually for security purposes — either for copy protection, or to check against tampering of a file during transmission across a network. By definition, these files will not work after the slightest program change, and this rules out their compaction.

### 3.3 Procedure Searching

We have used McCreight's suffix tree algorithm [19] to find procedure possibilities. This algorithm is also the basis for the FMW procedure search, but our point of departure is in the treatment of the program. Where FMW is looking for instances of exact textual matches in the assembly code, we are only looking for instances where *operations* match and allow *operands* to differ between instances.

Some properties of the suffix tree data structure have already been investigated in Section 2.5. Our implementation of the suffix tree algorithm accepts a string of operations as input, and constructs a tree with all possible substrings, and their positions, within the larger string. All substrings and positions are then examined (see next section), and passed to the algorithms that attempt to find the groupings of instances that will produce the best space savings given the substring instances (next chapter).

For example, suppose we are given the following contrived example:

```

add  %sp, 0x44, %o1
sll  %o0, 2, %o2
add  %o2, 4, %o2
add  %o1, %o2, %o2
sethi %hi(0x4000), %o3
save %sp, -152, %sp
call 0x21bc
sll  %o1, 4, %o1
add  %o3, 5, %o3
add  %o1, %o2, %o4
sethi %hi(0x8000), %o2
save %sp, -104, %sp
call 0x22c0
return

```

which yields a string of operations:

| op  | add | sll | add | add | sethi | save | call | sll | add | add | sethi | save | call | return |
|-----|-----|-----|-----|-----|-------|------|------|-----|-----|-----|-------|------|------|--------|
| pos | 0   | 1   | 2   | 3   | 4     | 5    | 6    | 7   | 8   | 9   | 10    | 11   | 12   | 13     |

The suffix tree for this little program can be found in Figure 3.5, and is printed as a prefix traversal of the nodes. Each line of the figure corresponds to a node, and the string of instructions listed on each line is the label on the edge leading into the node. Numbers on the left hand side indicate the level (depth) of the node ("0" indicates "root"), and numbers on the right mark the terminal nodes for each suffix starting at that position in the original string. All other nodes are internal, and are also referred to as "non-terminals".

Our list of substrings uses this last set of numbers (Table 3.1). Concatenating the labels encountered during a downward traversal from the root to any non-terminal forms the substring; all terminals descending from the non-terminal provide the starting positions of the substring. For instance, the substring "sethi save call" starts in two places: string positions 4 and 10. Notice that substrings in one row may be nested within others from a different row. Substring positions may be easily converted into real instruction locations given the fixed sized of SPARC instructions. Later analyses will examine these substrings, and the data gathered during disassembly will be needed.

Table 3.1 Substring table from suffix tree

| substring                   | starting positions |
|-----------------------------|--------------------|
| add sethi save call         | 3, 9               |
| add add sethi save call     | 2, 8               |
| sll add add sethi save call | 1, 7               |
| sethi save call             | 4, 10              |
| save call                   | 5, 11              |
| call                        | 6, 12              |
| add                         | 0, 2, 3, 8, 9      |

The McCreight algorithm has some important properties. It constructs the suffix tree using linear space, and executes in linear time. Each string position, which implies each

```

0:{empty}
1: add
2:   sll add add sethi save call sll add add sethi save call return (0)
2:   sethi save call
3:     sll add add sethi save call return (3)
3:     return (9)
2:   add sethi save call
3:     sll add add sethi save call return (2)
3:     return (8)
1: return (13)
1: call
2:   sll add add sethi save call return (6)
2:   return (12)
1: save call
2:   sll add add sethi save call return (5)
2:   return (11)
1: sethi save call
2:   sll add add sethi save call return (4)
2:   return (10)
1: sll add add sethi save call
2:   sll add add sethi save call return (1)
2:   return (7)

```

Figure 3.5 Suffix tree for example — prefix traversal

suffix (from that string position to the end of the string), adds one terminal node and at most one non-terminal node. Similarly, each suffix adds at most two new edges. Therefore, each step in the algorithm requires constant time. The algorithm also reduces the contribution of scanning edge labels to linear time. Each scan tries to determine where to split an edge and add a non-terminal. Unfortunately, linear time only applies once the edges suitable for scanning have been found. The effort to find the edge when the alphabet has  $n$  (in Figure 3.5,  $n = 6$ ) characters yields a potential  $O(n^2)$  algorithm. A hash table implementation could bring this down to  $O(n \lg n)$ . Therefore the algorithm is not linear, but it does find all substrings, including those nested within others.

### 3.4 Quick Comparison of Different Methods

There are some differences in the three methods (Marks [21], FMW [8], and ours) when examining the strings generated by the procedure search. This examination is necessary before passing the strings on to the instance grouping heuristics. Table 3.2 summarizes the differences between the three methods, with each category described below.

**1 — Different procedure invocation methods:** Using an unconditional branch to transfer control to a procedure is less expensive, in terms of both time and space usage, than using a subroutine call. However, the former method only works if the last instruction of the candidate instance is a branch to the same absolute address in every instance. If control flow proceeds directly to the instruction physically located after the candidate's last, then the subroutine mechanism must be used. Subroutines will behave correctly only if code for the candidate leaves a balanced stack (see the next category). Therefore, the instructions in the procedure determine how it is invoked within the main program.

Invocation methods may be classified by their effect on a compacted program's space and speed. "Low space — high time" categorizes Marks' tailored interpretation, where subroutines are invoked through one instruction, but only with the support of an interpreter. FMW purposely sets out to use a "low space — low time" method, rejecting the use of parameters, thereby requiring only one instruction for the procedure invocation. "High space — low time" describes our approach, which needs extra instructions to copy parameters into registers when invoking a procedure.

**2 — Check for stack balancing:** With most machine architectures, a subroutine call places registers and return addresses onto the user stack. If the code for a candidate does not push the same number of items that it pops, then the stack is returned to the caller in a state different than at call time, and the subroutine code is called "unbalanced". Aggressive balance checking schemes will also try to prune candidates (shortening by removing offending instructions from either end) to remove instructions until the stack is guaranteed to be balanced. Further complications involving stacks include expectations of data at a certain depth from the top of the stack, but pushing on a compacted routine's return address changes that depth. The compacted procedure instructions that use the stack must be changed to reflect the new depth.

**3 — Maximum number of parameters per procedure:** If two instances differ in one or more operand positions, and if both of these instances are to be covered by the same procedure, then parameters will be needed by the procedure. Before the procedure call,

these operand values are copied into parameter passing variables (usually registers) and used by the subroutine. Procedure results may be returned via these parameters, so their values must be copied back into their proper locations before processing continues with the next instruction following the call. We expect to see more "nearly identical" matches than "identical", so parameters appear unavoidable. In the past, some researchers have rejected parameters out of hand as "too expensive" (FMW), while another allows a single parameter (Marks — tailored interpretation). The actual expense of parameterization has not been examined.

Table 3.2 Possibilities during candidate examination

| compaction Method                                     | Marks | FMW | Ours                 |
|-------------------------------------------------------|-------|-----|----------------------|
| 1. Different procedure invocation methods             | no    | yes | yes                  |
| 2. Check for stack balancing                          | yes   | yes | yes                  |
| 3. Maximum number of parameters per created procedure | 1     | 0   | depends on heuristic |

We are most interested in item three. If there are too many parameters, then the cost of their use eliminates any space savings. If there are too few, then many compaction candidates will be rejected. We are interested in determining the minimum number of parameters that will lead to space-savings when compacting a set of instruction instances. A parameterizing heuristic could fix the maximum number of parameters, rejecting all those exceeding this maximum but at the risk of finding no space savings. Conversely, another heuristic could allow as many parameters as needed as long as space savings are the result, but this might produce smaller savings than if the number of parameters were fixed to a small number. These possibilities are investigated in the next chapter.

### 3.5 Summary

We have presented our view of the compaction problem, and have suggested combining parameterization with the grouping of matches into procedures as a way to outperform a scheme prohibiting parameters. Our task is to develop these grouping algorithms such that

parameter costs are minimized and space savings maximized. Some of the preliminary requirements for compacting binaries were reviewed, including a description of those programs which cannot be guaranteed correct after compaction. Finally, we reviewed the substring construction algorithm (suffix tree) that provides the raw material from which to choose procedures. Not all substrings can be used, and we presented a few criteria that must be met. Now it is time to inquire into the benefits of parameters, at the same time examining heuristics which can select procedures such that parameter costs are minimized.

## 4 Searching for Subroutines

Our compaction method for object code (binaries) has two phases:

- converting the binary into a form in which we can locate repeating code sections; and
- examining this code for suitability as procedures, rejecting or transforming them as needed.

The previous chapter covered the first phase and this chapter examines the second. Given a set of instruction strings that are similar to each other, we will now suggest some heuristics that can convert these substrings into space saving procedures even if the strings are not identical. A heuristic approach avoids exhaustive searches of the partitioning solution space. This chapter continues with the introduction of some notation, followed by a quick overview of the criteria which compaction candidates must satisfy. The procedure parameterization problem is presented, and several relations are introduced that will support our heuristic algorithms. Finally, the heuristic algorithms themselves are motivated, presented, and their performance examined.

### 4.1 Definitions

In the past chapters, we have used such terms as “substrings”, “instructions” and “operands” in an informal manner. To reduce ambiguity, we introduce these definitions.

- If  $x$  is an instruction, then define  $|x|$  to mean the number of operands contained in the instruction. If  $x$  is a string, then define  $|x|$  to be the number of instructions in the string.
- Define  $I$  to be an instruction consisting of an operation  $op$  and a (possibly empty) list of  $|I|$  operands,  $\omega_1, \omega_2, \dots, \omega_n$ . For instance, the instruction “sll %00, 2, %02” has  $op = sll$ ,  $|I| = 3$ ,  $\omega_1 = \%00$ ,  $\omega_2 = 2$ , and  $\omega_3 = \%02$ . The “no operation” instruction “nop” is represented by  $op = nop$  and  $|I| = 0$  (i.e. no operands). If there is no ambiguity, the operation for instruction  $I'$  is  $op'$ , and the instruction's  $i^{th}$  operand is  $\omega'_i$ .

- Define **program  $P$**  to be a sequence of  $m$  instructions  $I_1, I_2, \dots, I_m$ . A **substring** starting at position  $i$  and with length  $L$  is denoted by  $I_{i,L}$ .
- We write  $I' \equiv I''$  to denote an **instruction match** between instructions  $I'$  and  $I''$ , where two instructions may have different operand. This is the same as writing:  

$$|I'| = |I''|$$

$$\text{and } op' = op''$$
- An exact match occurs between instruction  $|I'|$  and  $|I''|$ , and is written as  $I' = I''$ , if:  

$$|I'| = |I''|$$

$$\text{and } op' = op''$$

$$\text{and } \omega_j' = \omega_j'' \quad 1 \leq j \leq |I'|$$
- Two substrings,  $I_{i,L}$  and  $I_{j,L}$ , are said to be **instances of a (procedure) candidate** if they do not overlap and if each operation is the same between instances; however, operand values may differ between instances. The number of instances in a candidate  $C$  is equal to  $|C|$ . Using our notation, two substrings are instances if:  

$$\{i, i+1, \dots, i+L-1\} \cap \{j, j+1, \dots, j+L-1\} = \emptyset \quad \text{and}$$

$$I_{i+k} \equiv I_{j+k} \quad 1 \leq k \leq L$$
- Define a **partition component  $\kappa$**  ("component" for short) to be a subset of instance indices from a procedure candidate (i.e.  $\kappa \subseteq \{1, \dots, |C|\}$ ). Each component is associated with a procedure body, and each instance in the component may be replaced in the original program by a branch or call to the associated procedure. There will be cases where a candidate instance will be excluded from any procedure, and left as found in the original program, because of the cost of the instance's inclusion. These excluded instances make up  $\kappa_e$ .
- Define a **partition  $\Pi$**  of the procedure candidate  $C$  to be a group of disjoint components. Each component corresponds to one procedure body, and component members are instances that are replaced by branches or calls to the procedures. If the partitioning does not cover all instances, then those excluded are left as found in the original object file.

Using the notation for a partition, we define a partitioning  $\Pi$  containing  $n$  partitions as  $\Pi = \{\kappa_0, \kappa_1, \kappa_2, \dots, \kappa_n\}$ , with the partitions satisfying the following conditions:

$$\begin{aligned} i \neq j \Rightarrow \kappa_i \cap \kappa_j &= \emptyset & \text{and} \\ \kappa_0 \cap \kappa_i &= \emptyset \quad 1 \leq i \leq n & \text{and} \\ \kappa_0 \cup \left( \bigcup_{i=1}^n \kappa_i \right) &= \{1, \dots, |C|\} \end{aligned}$$

## 4.2 Visualizing Operands

Thinking about the parameterization problem is difficult when the number of candidate instances is large. An easier method for visualizing the relationship of operand values between instances is by converting the candidate instances into an array of operand values. Each row of the array will correspond to a candidate instance, and a column contains the values of operands at one specific position across all instances. Operand text is replaced by an ordinal value (i.e. “v1” for “value 1”). Our implementation leaves out the instruction operations from the array, as these are guaranteed (as per our restriction that instructions must be candidate matches) to be the same between instances. However, as an aid to connecting array rows with candidates instances, the instruction operations will be shown in our figures in shaded columns. After substituting ordinal values for operand text from Figure 3.2 of Section 3.1, we have an array representation in (Figure 4.1). Row and column indices are provided to improve readability. Scanning down each of the eight operand

| inst | operands |    |    |    |     |     |    |     |    |    |
|------|----------|----|----|----|-----|-----|----|-----|----|----|
|      | 1        | 2  | 3  | 4  | 5   | 6   | 7  | 8   |    |    |
| I1   | 1d       | v1 | v2 | v1 | mov | v3  | v4 | mov | v5 | v6 |
| I2   | 1d       | v1 | v2 | v1 | mov | v8  | v4 | mov | v5 | v6 |
| I3   | 1d       | v1 | v2 | v1 | mov | v9  | v4 | mov | v5 | v6 |
| I4   | 1d       | v1 | v2 | v1 | mov | v8  | v4 | mov | v5 | v6 |
| I5   | 1d       | v1 | v2 | v1 | mov | v9  | v4 | mov | v5 | v6 |
| I6   | 1d       | v1 | v2 | v1 | mov | v10 | v4 | mov | v5 | v6 |

Figure 4.1 One parameter example —  $\Omega$ -array form

columns, we see that one parameter is needed for the operands in column 4, and this corresponds to the first operand of the "mov" instruction. Since all other operands have the same values across all instances, they do not require parameterization.

We call this two dimensional array the  $\Omega$ -array. Instances of the same candidate, made up of  $n$  substrings, are transformed into an array containing only the ordinal values for operand values. Each row  $\Omega_i$  contains the equivalent of the  $\omega$  values for the instruction operands in  $i^{th}$  substring starting in position  $p$  of the program (call it  $I_{p,L}$ ).

Operand values are positioned in the row in the same order in which they appear within instructions and between instructions. That is, if  $\Omega_{i,j} = \omega'$ ,  $\Omega_{i,k} = \omega''$ , and  $\omega'$  appears before  $\omega''$  in the  $I_{p,L}$  (either because  $\omega'$  is in an earlier instruction, or appears earlier in the same instruction) then  $j < k$  must hold. Since we know how many operands may be found in each of the substring's instructions (i.e.  $|I_p|, |I_{p+1}|, \dots, |I_{p+L-1}|$ ), any parameter choice based on the analysis of the  $\Omega$ -array can be related back to the instance's instructions and their associated operands.

As stated earlier, we require that instruction operations to be identical with each other between substrings. A more flexible scheme relaxes this restriction, allowing differences in operations between instances. Given this relaxation, Geschke's "strongly similar subroutine" optimization [9] could be implemented. Parameterizing for operation and operand differences, with the parameter value is used to select a compacted procedure statement from the set of possibilities, could produce procedures that cover very long strings. However, this would add another dimension to the parameterization problem, and even more space would be required both at the procedure call site and in the procedure body itself for parameters. Future work could examine the space savings when allowing for instruction operation and operand mismatches. However, we will concentrate exclusively on operand mismatches.

### 4.3 $\Omega$ -array Partitions

#### 4.3.1 Motivation

So far we have visually identified the positions of parameters. However, suppose we must parameterize something moderately large, as in Figure 4.2. What partitioning produces the

| Inst | operands |    |    |     |     |    |     |    |    |    |    |     |
|------|----------|----|----|-----|-----|----|-----|----|----|----|----|-----|
|      | 1        | 2  | 3  | 4   | 5   | 6  | 7   | 8  | 9  | 10 | 11 | 12  |
| i1   | ld       | v1 | v2 | v3  | mov | v4 | v5  | v6 | ld | v7 | v6 | v3  |
| i2   | ld       | v3 | v9 | v10 | mov | v9 | v11 | v3 | ld | v3 | v9 | v12 |
| i3   | ld       | v3 | v9 | v10 | mov | v9 | v11 | v3 | ld | v3 | v9 | v12 |
| i4   | ld       | v3 | v4 | v14 | mov | v4 | v5  | v3 | ld | v3 | v9 | v15 |
| i5   | ld       | v3 | v4 | v15 | mov | v9 | v17 | v3 | ld | v3 | v9 | v18 |
| i6   | ld       | v3 | v4 | v20 | mov | v9 | v21 | v3 | ld | v3 | v9 | v22 |
| i7   | ld       | v3 | v4 | v8  | mov | v9 | v24 | v3 | ld | v3 | v9 | v25 |

Figure 4.2 Complicated partitioning —  $\Omega$ -array

best compaction? If instances *i1* and *i4* are placed in  $\kappa_e$  (i.e. left uncompressed), then the other five instances could be covered by one procedure using 5 parameters. If they are not excluded, then the procedure would need 12 parameters, and space savings would be very unlikely. Before finding the minimum number of partitions, we must find out how many operand values are identical in each column. This is a scan that proceeds up and down (i.e. vertically) each of the columns of the  $\Omega$ -array, *partitioning* the set of instances into components with identical operand values.

Another (somewhat contrived) example is shown in Figure 4.3. Scanning down each column reveals that there are no operand matches between instances. However, only one operand value appears in each instance. A procedure covering all instances need only receive one parameter, and all operand positions are *tied* together by the same value. We have reduced the number of parameters from 7 to 1. As it is possible to reduce the cost of parameters by tying together column positions sharing the same value within an instance, we scan each row of the  $\Omega$ -array from left to right (i.e. horizontally), *partitioning* the set of operand positions into components with identical operand values.

| Instance | operands |    |    |    |     |    |     |     |    |    |    |    |
|----------|----------|----|----|----|-----|----|-----|-----|----|----|----|----|
|          | 1        | 2  | 3  | 4  | 5   | 6  | 7   | 8   | 9  | 10 | 11 |    |
| i1       | ld       | v1 | v1 | v1 | tst | v1 | nop | sll | v1 | v1 | b  | v1 |
| i2       | ld       | v2 | v2 | v2 | tst | v2 | nop | sll | v2 | v2 | b  | v2 |
| i3       | ld       | v3 | v3 | v3 | tst | v3 | nop | sll | v3 | v3 | b  | v3 |
| i4       | ld       | v4 | v4 | v4 | tst | v4 | nop | sll | v4 | v4 | b  | v4 |
| i5       | ld       | v5 | v5 | v5 | tst | v5 | nop | sll | v5 | v5 | b  | v5 |
| i6       | ld       | v6 | v6 | v6 | tst | v6 | nop | sll | v6 | v6 | b  | v6 |

Figure 4.3 Tieing example —  $\Omega$ -array

An exhaustive search of the partitioning solution space will always find the parameterization producing the best space savings, but this is not efficient. A heuristic algorithm still seems attractive, and we could use the components of vertical and horizontal partitions to guide our search. Vertical partition components appearing many times in array columns will suggest candidate groupings producing procedures with fewer parameters than those components which appear only a few times. Similarly, horizontal partition components containing a large number of operand positions and many instance appearances will reduce parameterization by eliminating the duplication of parameter values.

#### 4.3.2 Constructing Vertical Partitions

Vertical partitions span all instances of a candidate, and the members of each partition component are row indices from the  $\Omega$ -array. These members have the same operand value in a specific column (operand position) and therefore are grouped together. Values such as a vertical component's size (number of instances), and the number of times in which the component appears in a partitioning for a column (number of columns) are used by the heuristics. As these two numbers grow larger, the corresponding component becomes a desirable guide for grouping instances into procedures.

Our main interest is in any vertical partition's components. Each component is uniquely determined by the  $\Omega$ -array rows that it covers (instances), and this "name" is

stored in another array called  $\Gamma$ . This name refers to a set of array *row* indices. Each entry of  $\Gamma$  has a corresponding entry in the array  $G$ . Elements of  $G$  are a set of  $\Omega$ -array *columns* (operand positions) in which the component appears as part of the vertical partitioning. Therefore,  $\Gamma$  and  $G$  are arrays of sets, with each set of rows ( $\Gamma_z$ ) associated with a set of columns for which the instances have the same  $\omega$ -values ( $G_z$ ).

An algorithm to construct the vertical partitions and store the components, given an  $\Omega$ -array, is shown in Figure 4.4, and the results can be found in Table 4.1. After initializing

```

procedure buildVerticalPartitions( $\Omega$ , VAR  $\Gamma$ ; VAR  $G$ )
1  Initialize all elements of arrays  $\Gamma_z, G_z$  to  $\emptyset$ 
2  for  $j = 1$  to numOperands /* number of operands in instance */
3     $T = \{1, 2, \dots, |\Omega|\}$ 
4    while  $T \neq \emptyset$  do
5      select any  $i \in T$ 
6       $U := \{i\}$ 
7      for each  $i$  such that  $\Omega_{i,j} = \Omega_{i,j}$ 
8         $U := U \cup i$ 
9      find  $z$  such that  $\Gamma_z = U$ 
10     if not found, construct such  $\Gamma_z$  at next unused value of  $z$ 
11      $G_z := G_z \cup j$ 
12      $T := T - i$ 
13  return  $\Gamma, G$ 

```

Figure 4.4      Algorithm 2 — Finding Vertical Partition Components

all of the elements (line 1) in  $\Gamma$  and  $G$ , the algorithm examines each operand column of the  $\Omega$ -array in turn (line 2). All instances with the same  $\omega$ -value in this column are grouped together, one unique value at a time (lines 3 — 12), forming the vertical partition of column  $j$ . Each component ( $U$ ) is used to find a  $z$  such that  $\Gamma_z = U$  (line 9). If this is the first time that a component equal to  $U$  has appeared for this  $\Omega$ -array, then  $U$  is added to the  $\Gamma$ -array in the next open position (line 10). The column index  $j$  is added to  $G_z$ , indicating that the component described by  $\Gamma_z$  appears in the vertical partitioning for this column. All

components for the column are found before proceeding to the next column (lines 3, 5, 12). After processing all columns, the two arrays of sets are returned to the caller.

Running this algorithm on the  $\Omega$ -array of Figure 4.3 produces the components and component appearances shown in Table 4.1. By definition, all  $\Gamma_z$  values for an operand

Table 4.1 Vertical components for candidate in Figure 4.3

| z  | $\Gamma_z$<br>(components) | $G_z$ (component appearances) |                            |
|----|----------------------------|-------------------------------|----------------------------|
|    |                            | before closure                | after closure              |
| 1  | i1                         | 1,2,3,6,7,8,9,11,12           | 1,2,3,6,7,8,9,11,12,4,10,5 |
| 2  | i2,i3,i4,i5,i6,i7          | 1,6,7,8                       | 1,6,7,8                    |
| 3  | i2,i3                      | 2,3,5,9,11                    | 2,3,5,9,11,7,8,4,10,1,6,12 |
| 4  | i4,i5,i6,i7                | 2                             | 2,1,6,7,8                  |
| 5  | i4                         | 3,9,11                        | 3,9,11,1,2,4,5,6,7,8,10,12 |
| 6  | i5                         | 3,5,9,11                      | 3,5,9,11,1,2,4,6,7,8,10,12 |
| 7  | i6                         | 3,5,9,11                      | 3,5,9,11,1,2,4,6,7,8,10,12 |
| 8  | i7                         | 3,5,9,11,12                   | 3,5,9,11,1,2,4,6,7,8,10,12 |
| 9  | i1,i4                      | 4,5,10                        | 4,5,10                     |
| 10 | i2,i3,i5,i6,i7             | 4,10                          | 4,10,1,6,7,8               |
| 11 | i2,i3,i4,i5,i6             | 12                            | 12,1,6,7,8                 |

column are disjoint.  $\Gamma$  array entries may be subsets of other components in the  $\Gamma$ -array. For instance,  $\Gamma_2$  is {i2, i3, i4, i5, i6, i7}, and this component can be found in the partitioning of columns 1, 6, 7 and 8. Therefore, we expect that instances from any  $\Gamma$ -array entry which is a subset of  $\Gamma_4$  will have the same value in these columns. After performing a closure operation, the  $G$ -array entries corresponding to {i2, i3}, {i4, i5, i6, i7}, {i4}, {i5}, {i6}, {i7}, {i2, i3, i5, i6, i7}, and {i2, i3, i4, i5, i6} will contain columns 1, 6, 7 and 8 after closure. The closure operation can be applied to the  $G$ -array elements either during or after partition component construction.

#### 4.3.3 Constructing Horizontal Partitions

Vertical partitions suggest candidate groupings through components, as each component is a possible grouping of instances to a procedure. Since the instances of the component  $\Gamma_z$

have the same operand values in the columns contained in  $G_z$ , the instances may be covered with a procedure that does not need parameters for the columns in  $G_z$ . The remaining columns ( $\{1, \dots, numColumns\} - G_z$ ) do require parameters, and the values to be passed in could be tied together, and this results in a requirement for fewer parameters.

For each instance ( $\Omega$ -array row), operand positions ( $\Omega$ -array columns) with the same value are partitioned together. Components with a single member are operand values that appear only once within the instance (singletons), and are ignored when reporting ties. Table 4.2 contains the horizontal partitionings for the  $\Omega$ -array of Figure 4.2. Their construction is similar to the process for vertical partitions.

Table 4.2 Horizontal partitions for example in Figure 4.2

| instance | horizontal partition                                |
|----------|-----------------------------------------------------|
| i1       | {1} {2} {3, 9} {4, 10} {5} {6,8,12} {7}{11}         |
| i2       | {1, 6, 7, 12} {2, 4, 8, 10} {3} {5} {9} {11}        |
| i3       | {1, 6, 7, 12} {2, 4, 8, 10} {3} {5} {9} {11}        |
| i4       | {1, 6, 7, 12} {2, 4, 8, 10} {3} {5} {9} {11}        |
| i5       | {1, 6, 7, 12} {2, 8} {4, 10} {3} {5} {9} {11}       |
| i6       | {1, 6, 7, 12} {2, 8} {4, 10} {3} {5} {9} {11}       |
| i7       | {1, 6, 7} {2, 8} {4, 10} {3} {5} {9} {10} {11} {12} |

Given the horizontal partitions for each instance, we wish to find the ties that are valid for all instances of a vertical partition component. In our notation, capital letters are horizontal partitions, and lower case letters are their components. Given two partitions satisfying our definition in Section 4.1:

$$Q = \{q_1, q_2, \dots, q_n\} \quad R = \{r_1, r_2, \dots, r_m\}$$

where each of the components  $q$  and  $r$  are sets of column indices, a partition that is valid for both of them is called  $T$ , and defined as:

$$T = \{q \cap r | q \in Q, r \in R\}$$

We write the operation implementing this relation as:

$$T = Q \cap R$$

As an example, we take one possible grouping of instances from Figure 4.2 — {i3, i5, i7}. The following horizontal sets are taken from Table 4.2. We wish to know what ties will be valid for a procedure covering all instances of this partition. Given:

|     |               |               |         |         |     |      |
|-----|---------------|---------------|---------|---------|-----|------|
| i3: | {1, 6, 7, 12} | {2, 4, 8, 10} | {3}     | {5}     | {9} | {11} |
| i5: | {1, 6, 7, 12} | {2, 8}        | {4, 10} | {3}     | {5} | {9}  |
| i7: | {1, 6, 7}     | {12}          | {2, 8}  | {4, 10} | {3} | {5}  |

then the computation of  $T$  (where  $T$  happens to coincide with the value for i7) produces:

|    |           |      |        |         |     |     |     |      |
|----|-----------|------|--------|---------|-----|-----|-----|------|
| T: | {1, 6, 7} | {12} | {2, 8} | {4, 10} | {3} | {5} | {9} | {11} |
|----|-----------|------|--------|---------|-----|-----|-----|------|

Using Venn diagrams, one can see the “ $\cap$ ” operation applied to the different horizontal partition components in the computation of  $i3 \cap i7$  (Figure 4.5). Each of the intersections becomes a component of the tied partition. All columns in an intersection by themselves are operand positions that cannot safely share a parameter with any another position. For instance, i3 has the same operand values in columns 1, 6, 7, and 12; i7 does not include column 12 as part of the set. Therefore any tieing applied to both i3 and i7 must give column 12 its own parameter, otherwise the tie will copy in an incorrect value into operand 12 for instance i7.



Figure 4.5 Finding a valid tieing partition

Horizontal partitions are needed to find parameter ties, but parameter ties provide no clue as to which candidate groupings save the most space: parameter tieing makes a good

situation better. However, even a good tie partition will not reduce the number of parameters if the operand columns it covers do not need parameters.

```

/* The horizontal partitions  $H$  for the candidate and the
 * candidate partitioning component  $K$ , also known as a
 * candidate grouping, are passed in.
 * Valid ties for all instances are returned in  $T$ .
 */
procedure findValidTies(  $H$  ,  $K$  , VAR  $T$  )

  select any  $i \in K$ ;
   $K' := K - i$ ;
   $T = H_i$ ;
  for each  $i$  such that  $i \in K'$  do
     $T = T \cap H_i$ 
  return  $T$ 

```

Figure 4.6 Algorithm 3 — Find tieing valid for partition instances

With parameter ties valid for all component instances, we can calculate the cost of any mapping of candidate instances to procedures. Finding the components leading to the best space savings is the difficult part, and we look at this next.

#### 4.4 Heuristics

The main motivation for using vertical and horizontal partitions is to provide a starting point for exploration of the procedure/parameter solution space. A brute force approach which examines each possible partitioning is expensive because it is combinatorially explosive. Suppose that the candidate has been converted into an  $\Omega$ -array. If the number of rows in the array is  $m$ , and the number of partitions over the candidate is  $n$  and allowed to vary over  $1 \leq n \leq m$ ), then the number of unordered partitionings on  $m$  is:

$$\sum_{m=1}^n S(m, n) \quad (\text{EQ 4.1})$$

where  $S(m, n)$  is a Stirling number of the second kind [26]:

$$S(m, n) = \frac{\sum_{k=0}^{n-1} (-1)^k C(n, k) (n-k)^m}{n!} \quad \text{where} \quad C(n, k) = \frac{n!}{k! (n-k)!}$$

Even reasonably sized candidates can produce large numbers of combinations, and our heuristics regularly meet up with 10 to 15 instances in a candidate (see Table 4.3). Casting the question "how should the candidate instances be partitioned" into a known optimization problem also appears to be difficult.

Table 4.3 Possible number of partitionings

| number of instances (m) | number of unorderd partitionings (Eq. 1) | number of instances (m) | number of unorderd partitionings (Eq. 1) |
|-------------------------|------------------------------------------|-------------------------|------------------------------------------|
| 1                       | 1                                        | 8                       | 4140                                     |
| 2                       | 2                                        | 9                       | 21147                                    |
| 3                       | 4                                        | 10                      | 115,875                                  |
| 4                       | 15                                       | 15                      | 1,382,958,545                            |
| 5                       | 52                                       | 20                      | $-5.174 \times 10^{13}$                  |
| 6                       | 203                                      | 25                      | $-4.638 \times 10^{18}$                  |
| 7                       | 877                                      | 30                      | $-8.467 \times 10^{23}$                  |

In this section we do not aim for optimal compaction. Instead we attempt to find space saving opportunities that other compaction algorithms (Marks, FMW) pass over. For this more modest goal, we may appeal to heuristics. Our basic rule of thumb is that a grouping of instances to a procedure should (a) cover as many instances as possible while (b) resulting in as few parameters as possible. Requirement (a) maximizes the ratio of released space to procedure body space, and (b) improves the space savings at the original instance positions.

#### 4.4.1 Calculating Space Savings

Each heuristic is evaluated based on the space savings potential of its generated groupings. Computing space savings on architectures with variably sized instructions (e.g. Intel 80x86, Motorola 680x0) must take into account the size of the operands. However, SPARC instructions have a fixed size, which simplifies the space savings calculation. Parameter costs are also slightly reduced on the SPARC machines because all branch and call instructions are paired together with an extra instruction called the *delay slot*. This slot forces a compile-time decision as to which instruction is fed into the processor pipeline

when there is a change in control flow during program execution. We use this slot to copy a value into a parameter variable during a procedure invocation. If there are no parameters, then the first instruction of the procedure may be placed in the slot.

There are five parts to the calculation:

$S_u$  — *savings upper bound*: No scheme can do better than that which removes all instances from the code. All of the following costs are subtracted from this upper bound.

$C_1$  — *parameters copied at procedure entry*: A procedure call (closed subroutine) or branch (open routine) may use the delay slot to copy a parameter, so there is no cost for the first parameter. After the first, all other parameters cost one instruction.

$C_2$  — *parameters copied at procedure exit*: One instruction per parameter is needed. These copy the results from parameter registers to the original variable locations.

$C_3$  — *space needed for procedure body*: Procedures are almost the same size as a candidate instance. Two extra instructions are needed if the candidate is in closed form (subroutine) for the “return” instruction and its delay slot. No extra instructions are needed if the procedure is in open form.

$C_4$  — *branch or call instruction*: As mentioned in the description of  $C_1$ , one instruction is needed for the transfer of control to the procedure. A second is for the delay slot.

Each component of the candidate partition  $\Pi$  (except  $\kappa_e$ ) requires one procedure, and therefore contributes a list of costs. Different parameterizations for different partitions mean different values for  $C_1$ ,  $C_2$  and  $C_3$ . As instances of a candidate use the same procedure invocation mechanism,  $C_4$  remains fixed. In the equation,  $c_{i,j}$  is the cost  $c_i$  associated with parameterization of  $\kappa_j$ . For  $n$  components, the space saving is:

$$S_{\text{total}} = S_u - \sum_{j=1}^n (C_{1,j} + C_{2,j} + C_{3,j} + C_4) \quad (\text{EQ 4.2})$$

Equation 4.2 is the basis for evaluation of the partitioning heuristics presented in the rest of this chapter. After finding all the space savings for a binary given the advice of a specific

heuristic, the total space savings for the whole file are divided by the original binary size, multiplied by 100, and reported as the “% space savings”. There are four categories based on the number of lines that are disassembled as code: (a) <2K instructions; (b) 2K — 5K instructions; (c) 5K — 10K instructions; and (d) >10K instructions. Since we are investigating the performance of the heuristics, it is important to focus on how well they work, as distinct from how well the disassembler separates code from data. Improving disassembler code coverage is important, but as mentioned earlier in the thesis, if we are only given the binary file and its symbol table, 100% separation of code from data is expensive.

For simplicity of analysis, there is no nesting or overlapping of procedures. Our results were compared against the FMW results (in Section 4.4.6) if the number of instructions saved was greater than 50 and the space savings at least 1%. It *does* appear that the order in which candidates are evaluated by the heuristics affects the final result. We tried three different orderings: (a) by maximum potential space savings; (b) by the number of instances in the candidate; and (c) by length of an instance in the candidate (number of instructions). The best space savings of the three orderings are used for our reports. Out of 200 UNIX utilities, 174 produced a space savings greater than 1 instruction, and 67 produced a space savings greater than 50 instructions.

#### 4.4.2 FMW Compaction (Base Line)

A big surprise was the poor performance of the FMW algorithm. Although the 1984 paper [8] reported space savings ranging from 0 to 39% with an average of 7%, our implementation of FWM on the SPARC produced far poorer results. This could be due to several factors: the SPARC register set’s richness makes identical matches less frequent. The paper applied compaction to assembly code where, by definition, all the code and data is separated. FMW may be affected by presence of string constants within the code. This may explain the order of magnitude difference in the paper’s results and our own implementation’s results. However, this does not rule out comparison with our heuristics

— the substrings FMW finds will always be found by our algorithm. Results for the FMW scheme are in Table 4.4.

Table 4.4 FMW compaction — % savings

| instructions | average | min | max |
|--------------|---------|-----|-----|
| 0 — 2K       | 1.5     | 1.1 | 1.7 |
| 2K — 5K      | 2.3     | 1.0 | 5.4 |
| 5K — 10K     | 2.0     | 1.0 | 3.9 |
| 10K +        | 2.3     | 1.1 | 6.1 |

#### 4.4.3 Heuristic 1

This scheme attempts to find one procedure by maximizing the number of instances in a partition component  $|\Gamma_z|$  and maximizing the number of operand columns not needing a parameter. Larger operand column sets ( $|G_z|$ ) indicate that fewer parameters are needed for a partitioning based on  $\Gamma_z$ . All instances not included in the procedure are left as found in the original binary. A pseudocode algorithm appears in Figure 4.6.

```

procedure heuristic1(  $\Gamma$ ,  $G$ , VAR  $\kappa$  )

max := 0;
maxSet := -1;
 $\forall z$  do
    a :=  $|\Gamma_z|$ ; /* number of instances */
    b :=  $|G_z|$ ; /* number of operand positions */
    if ( a * b) > max then
         $\kappa$  :=  $\Gamma_z$ ;
        max := a * b;
return  $\kappa$ ;

```

Figure 4.7 Algorithm 4 — partitioning heuristic 1

The algorithm is greedy as it finds the partition that covers the largest area of the  $\Omega$ -array. If the disassembled binary contains substrings that heavily favor some operand values over others, then the space savings suggested by the heuristic are high. Results are poor if the number of instances making up the returned component is large while the number of operands making up each instance is small. That is, we find better space savings

if we have candidates with long instances (many instructions), but fewer of them in the component. One major drawback of heuristic 1 is that after making a good compaction choice, all excluded instances are completely ignored (i.e. placed in  $\kappa_e$ ) and are not re-examined for further compaction opportunities. Experimental results are in Table 4.5.

Table 4.5 Heuristic 1 compaction — % savings

| instructions | average | min | max |
|--------------|---------|-----|-----|
| 0 — 2K       | 2.0     | 1.1 | 3.2 |
| 2K — 5K      | 2.2     | 1.1 | 4.8 |
| 5K — 10K     | 2.2     | 1.0 | 5.3 |
| 10K +        | 2.6     | 1.0 | 8.5 |

#### 4.4.4 Heuristic 2

The drawbacks of the previous heuristic suggest another method. Partitionings of the candidates should contain more than two components (in heuristic 1, we have only  $\{\kappa_e, \kappa_1\}$ ). As we are interested in all those components (groups of instances) leading to procedures with potential for space savings, only those containing three or more instances are examined. All vertical components are sorted by decreasing number of columns ( $|G_z|$ ) — components with the greatest numbers of parameter free columns are examined first. The secondary sort field is the length of the component itself ( $|\Gamma_z|$ ) — the number of instances. Components  $\kappa$  are added to the partitioning  $\Pi$  if they are *instance disjoint* with those already in  $\Pi$  (i.e. an instance may not be replaced by two procedures). After examining the sorted components, all instances excluded from the selected partitions will remain as found in the original executable. Note that the excluded instances can support no other space saving partitions, and this is an improvement from heuristic 1. The algorithm is found in Figure 4.8.

This algorithm is also greedy, but it is allowed to progress further than heuristic 1. It is not significantly better (see Table 4.6). As the size of the program grows, it begins to perform worse than heuristic 1, and this goes against our intuition. After all, heuristic 2 should contain all the space savings of the previous scheme. This is not the case, though,

```

procedure heuristic2( G,  $\Gamma$ , VAR  $\Pi$ )
  for each  $z = 1 \dots |\Gamma|$  do
    if  $|\Gamma_z| < 3$  then
       $\Gamma_z, G_z := \emptyset$ ; /* discard set */
     $G' := \text{sort ( } G, \text{ primary key decreasing } |G|,$ 
           secondary key decreasing  $|\Gamma| \text{ );}$ 
     $\Gamma' := \Gamma \text{ sorted in the same order as } G';$ 
     $\Pi := \emptyset$ ; /* start with empty partitioning */
    for each  $z = 1 \dots |G'|$  do /* look at each set in sorted order */
       $s := \Gamma'_z$ ; /* s is a set of instances (rows) */
      if  $(\exists t \in s \text{ and } \exists k \in \Pi) \text{ and } (t \cap k = t)$  then
        ignore  $s$ ; /* set contains instance already in  $\Pi$  */
      else
         $\Pi = \Pi \cup s$ ; /* add set to partitioning */
      return  $\Pi$ ; /* return the partitioning */
  
```

Figure 4.8 Algorithm 5 — partitioning heuristic 2

as 2 places the highest priority on components with the fewest parameters, as opposed to heuristic 1 which tries to find the largest possible space savings in one procedure. Even though a heuristic 1 procedure may have several parameters, if it has a sufficiently large number of instances (and each instance is made up of many instructions), then better space savings would result with this rejected partition than with many smaller heuristic 2 procedures that use fewer parameters.

Table 4.6 Heuristic 2 compaction — % savings

| instructions | average | min | max |
|--------------|---------|-----|-----|
| 0 — 2K       | 2.0     | 1.2 | 3.0 |
| 2K — 5K      | 2.3     | 1.0 | 5.7 |
| 5K — 10K     | 2.3     | 1.0 | 4.9 |
| 10K +        | 2.1     | 1.0 | 7.3 |

#### 4.4.5 Heuristic 3

One point of view holds that parameters are too expensive: we are willing to pay at most for one or two. Heuristic 3 limits partitions to a maximum number of parameters. The method for this scheme (algorithm 6) may be found in Figure 4.9. It is difficult to detect an

```

procedure heuristic3( maxParms, G,  $\Gamma$ , VAR  $\Pi$  )

  c := 1;
  for each  $z$  do
    n := numParms(  $G_z$ ,  $\Gamma_z$  );
    if (  $n \leq maxParms$  ) then
       $N_c$  := n;
       $G'_c$  :=  $G_z$ ;
       $\Gamma'_c$  :=  $\Gamma_z$ ;
      c := c + 1;

   $G''$  := sort (  $G'$ , primary key  $|\Gamma_i|$ , secondary key decreasing  $|N_i|$  );
   $\Gamma''$  :=  $\Gamma'$  sorted in the same order as  $G''$ ;

  /* Algorithm is now similar to latter half of Heuristic #2 */

   $\Pi$  :=  $\emptyset$ ;           /* start with empty partitioning */
  for each  $c = 1 \dots |G''|$  do /* look at each set in sorted order */
    s :=  $\Gamma''_c$ ;           /* s is a set of instances (rows) */
    if (  $\exists i \in s$  and  $\exists k \in \Pi$  ) and (  $i \cap k = i$  ) then
      ignore s;           /* set contains instance already in  $\Pi$  */
    else
       $\Pi$  =  $\Pi \cup s$ ;       /* add set to partitioning */
  return  $\Pi$ ;           /* return the partitioning */

```

Figure 4.9 Algorithm 6—partitioning heuristic 3

improvement over the previous heuristics for small files, and some people may find the numbers insignificant. We have limited the number of parameters to one, and compaction does not appear to have suffered. The results of this heuristic are in Table 4.7.

A flexible algorithm could tune the maximum number of allowable parameters to the size of the candidates — shorter candidates would not be allowed to have any parameters.

Table 4.7 Heuristic 3 compaction — % savings

| instructions | average | min | max |
|--------------|---------|-----|-----|
| 0 — 2K       | 2.0     | 1.2 | 3.0 |
| 2K — 5K      | 1.7     | 1.0 | 3.9 |
| 5K — 10K     | 2.0     | 1.1 | 4.6 |
| 10K +        | 2.2     | 1.1 | 7.0 |

while candidates spanning many instructions could produce large enough space savings to sustain a bigger parameter cost. The longer the candidate, the greater the probability that there will be mismatches between operand values in the same column.

#### 4.4.6 “Best of heuristic”, and percentage improvement over FMW

The “best of heuristic” is the result of running each heuristic on a candidate, and then choosing the partitioning/procedures with the best space savings suggested by any heuristic. In many cases the best space savings for a binary are suggested by only one of the heuristics, but frequently the “best of heuristic” results for a binary are better than the results of any one of the three (see Table 4.8). This suggests that better results will come from compaction schemes that are tuned to several candidates, i.e. use one heuristic for candidates with many short instances, use another for candidates with fewer but longer instances, etc.

Table 4.8 Best of heuristic — % savings

| instructions | average | min | max |
|--------------|---------|-----|-----|
| 0 — 2K       | 2.0     | 1.1 | 3.2 |
| 2K — 5K      | 2.4     | 1.1 | 6.4 |
| 5K — 10K     | 2.5     | 1.1 | 6.6 |
| 10K +        | 2.7     | 1.1 | 9.3 |

Our final results appear worse if compared with the reported space savings in [8] (4% average versus our 2.7% average). However, as mentioned at the start of this section, our implementation of the FMW algorithm revealed that its performance was quite poor when

working with disassembled SPARC code. A more interesting comparison is made on a binary-by-binary basis, and the FMW results are compared with our own "Best of Heuristic". We have calculated the percentage space savings improvement of our scheme over FMW and averaged out these results for each of the program size categories. As programs become longer, our algorithm outperforms FMW even more. The large variances point out that occasionally a program is encountered with plenty of redundancy. Overall, however, these results show that better space savings may be achieved if we use parameters to change similar strings of instructions into one procedure.

Table 4.9 "Best of heuristic" improvement over FMW

| instructions | average (%) | min (%) | max (%) |
|--------------|-------------|---------|---------|
| 2K — 5K      | 49.1        | 13.9    | 115.4   |
| 5K — 10K     | 60.8        | 6.8     | 173.6   |
| 10K +        | 84.4        | 17.1    | 566.7   |

## 4.5 Further transformations that could be pursued

The heuristics given above do not transform the candidates themselves, but only pick and choose among candidates. More aggressive schemes should seek to transform the candidate instances such that compaction opportunities are improved.

### 4.5.1 Dropping instructions from instances

If the parameters are necessary because of differences in the instructions at the beginning or end of candidate instances, then these instruction could be dropped. An increase in the potential space savings results since the cost of adding parameters for the offending instruction is significantly higher than the space saved by including the instruction in the procedure.

### 4.5.2 Wasp waisting

This is a variation of the previous technique, but the instruction causing the parameter is now in the middle of the candidate. To remedy this, candidates are split in two, and covered